the surveyselect procedure

49
Chapter 63 The SURVEYSELECT Procedure Chapter Table of Contents OVERVIEW ................................... 3275 GETTING STARTED .............................. 3276 Simple Random Sampling ........................... 3277 Stratified Sampling ............................... 3279 Stratified Sampling with Control Sorting .................... 3282 SYNTAX ..................................... 3283 PROC SURVEYSELECT Statement ...................... 3284 CONTROL Statement ............................. 3294 ID Statement .................................. 3295 SIZE Statement ................................. 3295 STRATA Statement ............................... 3295 DETAILS ..................................... 3296 Missing Values ................................. 3296 Sorting by CONTROL Variables ........................ 3296 Sample Selection Methods ........................... 3297 Simple Random Sampling .......................... 3298 Unrestricted Random Sampling ....................... 3298 Systematic Random Sampling ....................... 3299 Sequential Random Sampling ........................ 3299 PPS Sampling without Replacement .................... 3300 PPS Sampling with Replacement ...................... 3302 PPS Systematic Sampling .......................... 3302 PPS Sequential Sampling .......................... 3303 Brewer’s PPS Method ............................ 3304 Murthy’s PPS Method ............................ 3305 Sampford’s PPS Method .......................... 3306 Output Data Set ................................. 3306 Displayed Output ................................ 3309 ODS Table Names ............................... 3310 EXAMPLES ................................... 3310 Example 63.1 Replicated Sampling ....................... 3310 Example 63.2 PPS Selection of Two Units Per Stratum ............ 3313

Upload: lyanh

Post on 01-Jan-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The SURVEYSELECT Procedure

Chapter 63The SURVEYSELECT Procedure

Chapter Table of Contents

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3275

GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3276Simple Random Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . .3277Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3279Stratified Sampling with Control Sorting. . . . . . . . . . . . . . . . . . . .3282

SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3283PROC SURVEYSELECT Statement. . . . . . . . . . . . . . . . . . . . . .3284CONTROL Statement . .. . . . . . . . . . . . . . . . . . . . . . . . . . .3294ID Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3295SIZE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3295STRATA Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3295

DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3296Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3296Sorting by CONTROL Variables . .. . . . . . . . . . . . . . . . . . . . . .3296Sample Selection Methods. . . . . . . . . . . . . . . . . . . . . . . . . . .3297

Simple Random Sampling. . . . . . . . . . . . . . . . . . . . . . . . . .3298Unrestricted Random Sampling .. . . . . . . . . . . . . . . . . . . . . .3298Systematic Random Sampling .. . . . . . . . . . . . . . . . . . . . . .3299Sequential Random Sampling . .. . . . . . . . . . . . . . . . . . . . . .3299PPS Sampling without Replacement. . . . . . . . . . . . . . . . . . . .3300PPS Sampling with Replacement. . . . . . . . . . . . . . . . . . . . . .3302PPS Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . .3302PPS Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . . . .3303Brewer’s PPS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .3304Murthy’s PPS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .3305Sampford’s PPS Method . . . . . . . . . . . . . . . . . . . . . . . . . .3306

Output Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3306Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3309ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3310

EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3310Example 63.1 Replicated Sampling . . . . . . . . . . . . . . . . . . . . . . .3310Example 63.2 PPS Selection of Two Units Per Stratum . . . . . . . . . . . .3313

Page 2: The SURVEYSELECT Procedure

3274 � Chapter 63. The SURVEYSELECT Procedure

Example 63.3 PPS (Dollar-Unit) Sampling . . . . . . . . . . . . . . . . . . .3315

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3319

SAS OnlineDoc: Version 8

Page 3: The SURVEYSELECT Procedure

Chapter 63The SURVEYSELECT Procedure

Overview

The SURVEYSELECT procedure provides a variety of methods for selectingprobability-based random samples. The procedure can select a simple random sampleor a sample according to a complex multistage sample design that includes stratifi-cation, clustering, and unequal probabilities of selection. With probability sampling,each unit in the survey population has a known, positive probability of selection.This property of probability sampling avoids selection bias and enables you to usestatistical theory to make valid inferences from the sample to the survey population.

To select a sample with PROC SURVEYSELECT, you input a SAS data set that con-tains the sampling frame, or list of units from which the sample is to be selected. Youalso specify the selection method, the desired sample size or sampling rate, and otherselection parameters. The SURVEYSELECT procedure selects the sample, produc-ing an output data set that contains the selected units, their selection probabilities, andsampling weights. When you are selecting a sample in multiple stages, you invokethe procedure separately for each stage of selection, inputting the frame and selectionparameters for each current stage.

The SURVEYSELECT procedure provides methods for both equal probability sam-pling and probability proportional to size (PPS) sampling. In equal probability sam-pling, each unit in the sampling frame, or in a stratum, has the same probability ofbeing selected for the sample. In PPS sampling, a unit’s selection probability is pro-portional to its size measure. For details on probability sampling methods, refer toKish (1987, 1965), Kalton (1983), and Cochran (1977).

The SURVEYSELECT procedure provides the following equal probability samplingmethods:

� simple random sampling

� unrestricted random sampling (with replacement)

� systematic random sampling

� sequential random sampling

This procedure also provides the following probability proportional to size (PPS)methods:

� PPS sampling without replacement

� PPS sampling with replacement

Page 4: The SURVEYSELECT Procedure

3276 � Chapter 63. The SURVEYSELECT Procedure

� PPS systematic sampling

� PPS algorithms for selecting two units per stratum

� sequential PPS sampling with minimum replacement

The procedure uses fast, efficient algorithms for these sample selection methods.Thus, it performs well even for very large input data sets or sampling frames, whichmay occur in practice for large-scale sample surveys.

The SURVEYSELECT procedure can perform stratified sampling, selecting samplesindependently within the specified strata, or nonoverlapping subgroups of the surveypopulation. Stratification controls the distribution of the sample size in the strata.It is widely used in practice towards meeting a variety of survey objectives. Forexample, with stratification you can ensure adequate sample sizes for subgroups ofinterest, including small subgroups, or you can use stratification towards improvingthe precision of the overall estimates. When you are using a systematic or sequentialselection method, the SURVEYSELECT procedure also can sort by control variableswithin strata for the additional control of implicit stratification.

The SURVEYSELECT procedure provides replicated sampling, where the total sam-ple is composed of a set of replicates, each selected in the same way. You can usereplicated sampling to study variable nonsampling errors, such as variability in theresults obtained by different interviewers. You can also use replication to computestandard errors for the combined sample estimates.

Getting Started

In this example, an Internet service provider wants to conduct a customer satisfactionsurvey. The survey population consists of the company’s current subscribers. Thecompany plans to select a sample of customers from this population, interview theselected customers, and then make inferences about the entire survey population fromthe sample data.

The SAS data setCustomers contains the sampling frame, which is the list of unitsin the survey population. The sample of customers will be selected from this sam-pling frame. The data setCustomers is constructed from the company’s customerdatabase. It contains one observation for each customer, with a total of 13,471 obser-vations. Figure 63.1 displays the first ten observations of the data setCustomers.

SAS OnlineDoc: Version 8

Page 5: The SURVEYSELECT Procedure

Simple Random Sampling � 3277

Internet Service Provider Customers(First 10 Observations)

Obs CustomerID State Type Usage

1 416-87-4322 AL New 8392 288-13-9763 GA Old 2243 339-00-8654 GA Old 24514 118-98-0542 GA New 3495 421-67-0342 FL New 5626 623-18-9201 SC New 687 324-55-0324 FL Old 1378 832-90-2397 AL Old 15639 586-45-0178 GA New 615

10 801-24-5317 SC New 728

Figure 63.1. Customers Data Set (First 10 Observations)

In the SAS data setCustomers, the variableCustomerID uniquely identifies eachcustomer. The variableState contains the state of the customer’s address. Thecompany has customers in the following four states: Georgia (GA), Alabama (AL),Florida (FL), and South Carolina (SC). The variableType equals ‘Old’ if the cus-tomer has subscribed to the service for more than one year; otherwise, the variableType equals ‘New’. The variableUsage contains the customer’s average monthlyservice usage, in minutes.

The following sections illustrate the use of PROC SURVEYSELECT for probabilitysampling with three different designs for the customer satisfaction survey. All threedesigns are one stage, with customers as the sampling units. The first design is simplerandom sampling without stratification. In the second design, customers are stratifiedby state and type, and the sample is selected by simple random sampling within strata.In the third design, customers are sorted within strata by usage, and the sample isselected by systematic random sampling within strata.

Simple Random Sampling

The following PROC SURVEYSELECT statements select a probability sample ofcustomers from theCustomers data set using simple random sampling.

title1 ’Customer Satisfaction Survey’;proc surveyselect data=Customers method=srs n=100

out=SampleSRS;run;

The PROC SURVEYSELECT statement invokes the procedure. The DATA= optionnames the SAS data setCustomers as the input data set from which to select thesample. The METHOD=SRS option specifies simple random sampling as the sampleselection method. In simple random sampling, each unit has an equal probabilityof selection, and sampling is without replacement. Without-replacement samplingmeans that a unit cannot be selected more than once. The N=100 option specifies asample size of 100 customers. The OUT= option stores the sample in the SAS dataset namedSampleSRS.

SAS OnlineDoc: Version 8

Page 6: The SURVEYSELECT Procedure

3278 � Chapter 63. The SURVEYSELECT Procedure

Figure 63.2 displays the output from PROC SURVEYSELECT, which summarizesthe sample selection. A sample of 100 customers is selected from the data setCustomers by simple random sampling. With simple random sampling and no strat-ification in the sample design, the selection probability is the same for all units in thesample. In this sample, the selection probability for each customer equals 0.007423,which is the sample size (100) divided by the population size (13,471). The sam-pling weight equals 134.71 for each customer in the sample, where the weight isthe inverse of the selection probability. If you specify the STATS option, PROCSURVEYSELECT includes the selection probabilities and sampling weights in theoutput data set. (This information is always included in the output data set for morecomplex designs.)

The random number seed is 39647. PROC SURVEYSELECT uses this number as theinitial seed for random number generation. Since the SEED= option is not specifiedin the PROC SURVEYSELECT statement, the seed value is obtained using the timeof day from the computer’s clock. You can specify SEED=39647 to reproduce thissample.

Customer Satisfaction Survey

The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set CUSTOMERSRandom Number Seed 39647Sample Size 100Selection Probability 0.007423Sampling Weight 134.71Output Data Set SAMPLESRS

Figure 63.2. Sample Selection Summary

The sample of 100 customers is stored in the SAS data setSampleSRS. PROCSURVEYSELECT does not display this output data set. The following PROC PRINTstatements display the first 20 observations ofSampleSRS.

title1 ’Customer Satisfaction Survey’;title2 ’Sample of 100 Customers, Selected by SRS’;title3 ’(First 20 Observations)’;proc print data=SampleSRS(obs=20);run;

Figure 63.3 displays the first 20 observations of the output data setSampleSRS,which contains the sample of customers. This data set includes all the variables fromthe DATA= input data setCustomers. If you do not want to include all variables,you can use the ID statement to specify which variables to copy from the input dataset to the output (sample) data set.

SAS OnlineDoc: Version 8

Page 7: The SURVEYSELECT Procedure

Stratified Sampling � 3279

Customer Satisfaction SurveySample of 100 Customers, Selected by SRS

(First 20 Observations)

Obs CustomerID State Type Usage

1 036-89-0212 FL New 742 045-53-3676 AL New 4113 050-99-2380 GA Old 1674 066-93-5368 AL Old 12325 082-99-9234 FL New 906 097-17-4766 FL Old 1317 110-73-1051 FL Old 1028 111-91-6424 GA New 2479 127-39-4594 GA New 61

10 162-50-3866 FL New 10011 162-56-1370 FL New 22412 167-21-6808 SC New 6013 168-02-5189 AL Old 755314 174-07-8711 FL New 28415 187-03-7510 SC New 2116 190-78-5019 GA New 18517 200-75-0054 GA New 22418 201-14-1003 GA Old 343719 207-15-7701 GA Old 2420 211-14-1373 AL Old 88

Figure 63.3. Customer Sample (First 20 Observations)

Stratified Sampling

In this section, stratification is added to the sample design for the customer satisfac-tion survey. The sampling frame, or list of all customers, is stratified byState andType. This divides the sampling frame into nonoverlapping subgroups formed fromthe values of theState andType variables. Samples are then selected independentlywithin the strata.

PROC SURVEYSELECT requires that the input data set be sorted by the STRATAvariables. The following PROC SORT statements sort theCustomers data set by thestratification variablesState andType.

proc sort data=Customers;by State Type;

run;

The following PROC FREQ statements display the crosstabulation of theCustomersdata set byState andType.

proc freq data=Customers;tables State*Type;

run;

Figure 63.4 presents the table ofState by Type for the 13,471 customers. There arefour states and two levels ofType, forming a total of eight strata.

SAS OnlineDoc: Version 8

Page 8: The SURVEYSELECT Procedure

3280 � Chapter 63. The SURVEYSELECT Procedure

The FREQ Procedure

Table of State by Type

State Type

Frequency|Percent |Row Pct |Col Pct |New |Old | Total---------+--------+--------+AL | 1238 | 706 | 1944

| 9.19 | 5.24 | 14.43| 63.68 | 36.32 || 14.43 | 14.43 |

---------+--------+--------+FL | 2170 | 1370 | 3540

| 16.11 | 10.17 | 26.28| 61.30 | 38.70 || 25.29 | 28.01 |

---------+--------+--------+GA | 3488 | 1940 | 5428

| 25.89 | 14.40 | 40.29| 64.26 | 35.74 || 40.65 | 39.66 |

---------+--------+--------+SC | 1684 | 875 | 2559

| 12.50 | 6.50 | 19.00| 65.81 | 34.19 || 19.63 | 17.89 |

---------+--------+--------+Total 8580 4891 13471

63.69 36.31 100.00

Figure 63.4. Stratification of Customers by State and Type

The following PROC SURVEYSELECT statements select a probability sample ofcustomers from theCustomers data set according to the stratified sample design.

title1 ’Customer Satisfaction Survey’;title2 ’Stratified Sampling’;proc surveyselect data=Customers method=srs n=15

seed=1953 out=SampleStrata;strata State Type;

run;

The STRATA statement names the stratification variablesState and Type. In thePROC SURVEYSELECT statement, the METHOD=SRS option specifies simplerandom sampling. The N=15 option specifies a sample size of 15 customers for eachstratum. If you want to specify different sample sizes for different strata, you canuse the N=SAS-data-setoption to name a secondary data set that contains the stratumsample sizes. The SEED=1953 option specifies ’1953’ as the initial seed for randomnumber generation.

Figure 63.5 displays the output from PROC SURVEYSELECT, which summarizesthe sample selection. A total of 120 customers are selected.

SAS OnlineDoc: Version 8

Page 9: The SURVEYSELECT Procedure

Stratified Sampling � 3281

Customer Satisfaction SurveyStratified Sampling

The SURVEYSELECT Procedure

Selection Method Simple Random SamplingStrata Variables State

Type

Input Data Set CUSTOMERSRandom Number Seed 1953Stratum Sample Size 15Number of Strata 8Total Sample Size 120Output Data Set SAMPLESTRATA

Figure 63.5. Sample Selection Summary

The following PROC PRINT statements display the first 30 observations of the outputdata setSampleStrata.

title1 ’Customer Satisfaction Survey’;title2 ’Sample Selected by Stratified Design’;title3 ’(First 30 Observations)’;proc print data=SampleStrata(obs=30);run;

Figure 63.6 displays the first 30 observations of the output data setSampleStrata,which contains the sample of 120 customers, 15 customers from each of the 8 strata.The variableSelectionProb contains the selection probability for each customer inthe sample. Since customers are selected with equal probability within strata in thisdesign, the selection probability equals the stratum sample size (15) divided by thestratum population size. The selection probabilities differ from stratum to stratumsince the population sizes differ. The selection probability for each customer in thefirst stratum (State=‘AL’ and Type=‘New’) is 0.012116, and the selection probabilityis 0.021246 for customers in the second stratum. The variableSamplingWeightcontains the sampling weights, which are computed as inverse selection probabilities.

SAS OnlineDoc: Version 8

Page 10: The SURVEYSELECT Procedure

3282 � Chapter 63. The SURVEYSELECT Procedure

Customer Satisfaction SurveySample Selected by Stratified Design

(First 30 Observations)

Selection SamplingObs State Type CustomerID Usage Prob Weight

1 AL New 002-26-1498 1189 0.012116 82.53332 AL New 070-86-8494 106 0.012116 82.53333 AL New 121-28-6895 76 0.012116 82.53334 AL New 131-79-7630 265 0.012116 82.53335 AL New 211-88-4991 108 0.012116 82.53336 AL New 222-81-3742 83 0.012116 82.53337 AL New 238-46-3776 278 0.012116 82.53338 AL New 370-01-0671 123 0.012116 82.53339 AL New 407-07-5479 1580 0.012116 82.5333

10 AL New 550-90-3188 177 0.012116 82.533311 AL New 582-40-9610 46 0.012116 82.533312 AL New 672-59-9114 66 0.012116 82.533313 AL New 848-60-3119 28 0.012116 82.533314 AL New 886-83-4909 170 0.012116 82.533315 AL New 993-31-7677 64 0.012116 82.533316 AL Old 124-60-0495 80 0.021246 47.066717 AL Old 128-54-9590 56 0.021246 47.066718 AL Old 204-05-4017 17 0.021246 47.066719 AL Old 210-68-8704 4363 0.021246 47.066720 AL Old 239-75-4343 430 0.021246 47.066721 AL Old 317-70-6496 452 0.021246 47.066722 AL Old 365-37-1340 21 0.021246 47.066723 AL Old 399-78-7900 108 0.021246 47.066724 AL Old 404-90-6273 824 0.021246 47.066725 AL Old 421-04-8548 1332 0.021246 47.066726 AL Old 604-48-0587 16 0.021246 47.066727 AL Old 774-04-0162 318 0.021246 47.066728 AL Old 849-66-4156 79 0.021246 47.066729 AL Old 937-69-9106 182 0.021246 47.066730 AL Old 985-09-8691 24 0.021246 47.0667

Figure 63.6. Customer Sample (First 30 Observations)

Stratified Sampling with Control Sorting

The next sample design for the customer satisfaction survey uses stratification byState. The sampling frame is also sorted byType andUsage before sample selec-tion, to provide additional control over the distribution of the sample. Customers arethen selected by systematic random sampling within strata. The following PROCSURVEYSELECT statements select a probability sample of customers from theCustomers data set using this design.

title1 ’Customer Satisfaction Survey’;title2 ’Stratified Sampling with Control Sorting’;proc surveyselect data=Customers method=sys seed=1234

rate=.02 out=SampleControl;strata State;control Type Usage;

run;

The STRATA statement names the stratification variableState. The CONTROLstatement names the control variablesType and Usage. In the PROC

SAS OnlineDoc: Version 8

Page 11: The SURVEYSELECT Procedure

Syntax � 3283

SURVEYSELECT statement, the METHOD=SYS option requests systematic ran-dom sampling. The SEED=1234 option specifies the initial seed for random numbergeneration. The RATE=.02 option specifies a sampling rate of 2% for each stratum.

Figure 63.7 displays the output from PROC SURVEYSELECT, which summarizesthe sample selection. A sample of 271 customers is selected, using systematic ran-dom sampling within strata determined byState. The sampling frameCustomersis sorted by control variablesType andUsage within strata. The type of sorting isserpentine, which is used by default since SORT=NEST is not specified. See the sec-tion “Sorting by CONTROL Variables” on page 3296 for a description of serpentinesorting. The output data setSampleControl contains the sample of customers.

Customer Satisfaction SurveyStratified Sampling with Control Sorting

The SURVEYSELECT Procedure

Selection Method Systematic Random SamplingStrata Variable StateControl Variables Type

UsageControl Sorting Serpentine

Input Data Set CUSTOMERSRandom Number Seed 1234Stratum Sampling Rate 0.02Number of Strata 4Total Sample Size 271Output Data Set SAMPLECONTROL

Figure 63.7. Sample Selection Summary

Syntax

The following statements are available in PROC SURVEYSELECT.

PROC SURVEYSELECT options ;STRATA variables ;CONTROL variables ;SIZE variable ;ID variables ;

The PROC SURVEYSELECT statement invokes the procedure and optionally identi-fies input and output data sets. It also specifies the selection method, the sample size,and other sample design parameters. The SURVEYSELECT statement is required.

The SIZE statement identifies the variable that contains the size measures. It is re-quired for any selection method that is probability proportional to size (PPS).

The remaining statements are optional. The STRATA statement identifies a variableor set of variables that stratify the input data set. When you specify a STRATAstatement, PROC SURVEYSELECT selects samples independently from the strataformed by the STRATA variables. The CONTROL statement identifies variables for

SAS OnlineDoc: Version 8

Page 12: The SURVEYSELECT Procedure

3284 � Chapter 63. The SURVEYSELECT Procedure

ordering units within strata. It can be used for systematic and sequential samplingmethods. The ID statement identifies variables to copy from the input data set to theoutput data set of selected units.

The rest of this section gives detailed syntax information for the CONTROL, ID,SIZE, and STRATA statements in alphabetical order after the description of thePROC SURVEYSELECT statement.

PROC SURVEYSELECT Statement

PROC SURVEYSELECT options ;

The PROC SURVEYSELECT statement invokes the procedure and optionally iden-tifies input and output data sets. If you do not name a DATA= input data set, theprocedure selects the sample from the most recently created SAS data set. If you donot name an OUT= output data set to contain the sample of selected units, the proce-dure still creates an output data set and names it according to the DATAn convention.

The PROC SURVEYSELECT statement also specifies the sample selection method,the sample size, and other sample design parameters. If you do not specify a selectionmethod, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS)if there is no SIZE statement. If you specify a SIZE statement but do not specifya selection method, PROC SURVEYSELECT uses probability proportional to sizeselection without replacement (METHOD=PPS). You must specify the sample sizeor sampling rate unless you request a method that selects two units from each stratum(METHOD=PPS–BREWER or METHOD=PPS–MURTHY).

You can use the SAMPSIZE=n option to specify the sample size, or you can use theSAMPSIZE=SAS-data-setoption to name a secondary input data set that containsstratum sample sizes. You can also specify stratum sampling rates, minimum sizemeasures, maximum size measures, and certainty size measures in the secondaryinput data set. See the descriptions of the SAMPSIZE=, SAMPRATE=, MINSIZE=,MAXSIZE=, and CERTSIZE= options. You can name only one secondary input dataset in each invocation of the procedure.

The following table lists the options available with the PROC SURVEYSELECTstatement. Descriptions follow in alphabetical order.

SAS OnlineDoc: Version 8

Page 13: The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement � 3285

Table 63.1. PROC SURVEYSELECT Statement Options

Task OptionsSpecify the input data set DATA=

Specify output data sets OUT=OUTSORT=

Suppress displayed output NOPRINTSpecify selection method METHOD=Specify sample size SAMPSIZE=

Specify sampling rate SAMPRATE=NMIN=NMAX=

Specify number of replicates REP=

Adjust size measures MINSIZE=MAXSIZE=

Specify certainty size measures CERTSIZE=Specify sorting type SORT=Specify random number seed SEED=

Control OUT= contents JTPROBSOUTHITSOUTSIZESTATS

You can specify the following options in the PROC SURVEYSELECT statement.

CERTSIZErequests automatic selection of those units with size measures greater than or equal tothe stratum certainty size measures, which you provide in the secondary input data setvariable–CERTSIZE– . Use the CERTSIZE option when you have already namedthe secondary input data set in another option, such as SAMPSIZE=SAS-data-set,SAMPRATE=SAS-data-set, MAXSIZE=SAS-data-set, or MINSIZE=SAS-data-set.You can name only one secondary input data set in each invocation of the procedure.

If any size measure is greater than or equal to the certainty size measure for its stra-tum, then PROC SURVEYSELECT selects this unit with certainty. Each certaintysize measure must be a positive number. The CERTSIZE option is available forMETHOD=PPS and METHOD=PPS–SAMPFORD.

If you want to specify a single certainty size measure in the PROC SURVEYSELECTstatement, use the CERTSIZE=certainoption.

CERTSIZE=certainspecifies the certainty size measure. PROC SURVEYSELECT selects with certaintyany unit with size measure greater than or equal to the valuecertain. The certaintysize measure must be a positive number. This option is available for METHOD=PPSand METHOD=PPS–SAMPFORD.

If you request a stratified sample design with a STRATA statement and specify theCERTSIZE= option, PROC SURVEYSELECT uses the certainty sizecertain for

SAS OnlineDoc: Version 8

Page 14: The SURVEYSELECT Procedure

3286 � Chapter 63. The SURVEYSELECT Procedure

all strata. If you do not want to use the same certainty size for all strata, use theCERTSIZE=SAS-data-setoption to specify a certainty size for each stratum.

CERTSIZE=SAS-data-setnames a SAS data set that contains the certainty size measures for the strata. PROCSURVEYSELECT selects with certainy any unit with size measure greater thanor equal to the certainty size measure for its stratum. Each certainty size mea-sure must be a positive number. This option is available for METHOD=PPS andMETHOD=PPS–SAMPFORD.

The CERTSIZE= input data set should contain all the STRATA variables, with thesame type and length as in the DATA= data set. The STRATA groups should ap-pear in the same order in the CERTSIZE= data set as in the DATA= data set. TheCERTSIZE= data set should have a variable named–CERTSIZE– that contains thecertainty size measure for each stratum.

CERTSIZE=P=pspecifies the certainty proportion. PROC SURVEYSELECT selects with certaintyany unit with size measure greater than or equal to the proportionpof the total size forall units in the stratum. The procedure repeats this process with the remaining unitsuntil no more certainty units are selected. This option is available for METHOD=PPSand METHOD=PPS–SAMPFORD.

The certainty proportion must be a positive number. You can specifyp as a numberbetween 0 and 1. Or you can specifyp in percentage form as a number between 1and 100, and PROC SURVEYSELECT converts that number to a proportion. Theprocedure treats the value 1 as 100%, and not the percentage form 1%.

If you request a stratified sample design with a STRATA statement and specify theCERTSIZE=P= option, PROC SURVEYSELECT uses the same certainty proportionp for all strata.

DATA=SAS-data-setnames the SAS data set from which PROC SURVEYSELECT selects the sample. Ifyou omit the DATA= option, the procedure uses the most recently created SAS dataset. In sampling terminology, the input data set is thesampling frame, or list of unitsfrom which the sample is selected.

JTPROBSincludes joint probabilities of selection in the OUT= output data set. This op-tion is available for the following probability proportional to size selection meth-ods: METHOD=PPS, METHOD=PPS–SAMPFORD, and METHOD=PPS–WR.By default, PROC SURVEYSELECT outputs joint selection probabilities forMETHOD=PPS–BREWER and METHOD=PPS–MURTHY, which select two unitsper stratum. For more information on the contents of the output data set, see thesection “Output Data Set” on page 3306.

MAXSIZErequests size measure adjustment by stratum maximum size measures, whichyou provide in the secondary input data set variable–MAXSIZE– . Use theMAXSIZE option when you have already named the secondary input data set

SAS OnlineDoc: Version 8

Page 15: The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement � 3287

in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set,MINSIZE=SAS-data-set, or CERTSIZE=SAS-data-set. You can name only one sec-ondary input data set in each invocation of the procedure.

If any size measure exceeds the maximum size measure for its stratum, then PROCSURVEYSELECT adjusts this size measure downward to equal the maximum sizemeasure. Each maximum size measure must be a positive number. The MAXSIZEoption is available whenever you specify a SIZE statement for probability propor-tional to size selection and a STRATA statement for stratification.

If you want to specify a single maximum size value in the PROC SURVEYSELECTstatement, use the MAXSIZE=maxoption.

MAXSIZE=maxspecifies the maximum allowable size measure. If any size measure exceeds the valuemax, then PROC SURVEYSELECT adjusts this size measure to equalmax. Themaximum size measure must be a positive number. This option is available wheneveryou specify a SIZE statement for selection with probability proportional to size.

If you request a stratified sample design with a STRATA statement and specify theMAXSIZE= option, PROC SURVEYSELECT uses the maximum sizemax for allstrata. If you do not want to use the same maximum size for all strata, use theMAXSIZE=SAS-data-setoption to specify a maximum size for each stratum.

MAXSIZE=SAS-data-setnames a SAS data set that contains the maximum allowable size measures for thestrata. If any size measure exceeds the maximum size measure for its stratum, thenPROC SURVEYSELECT adjusts this size measure downward to equal the maximumsize measure. Each maximum size measure must be a positive number. This option isavailable whenever you specify a SIZE statement for probability proportional to sizeselection and a STRATA statement for stratified selection.

The MAXSIZE= input data set should contain all the STRATA variables, with thesame type and length as in the DATA= data set. The STRATA groups should appear inthe same order in the MAXSIZE= data set as in the DATA= data set. The MAXSIZE=data set should have a variable named–MAXSIZE– that contains the maximum sizemeasure for each stratum.

METHOD=nameM=name

specifies the method for sample selection. If you do not specify the METHOD=option, by default PROC SURVEYSELECT uses simple random sampling(METHOD=SRS) if there is no SIZE statement. If you specify a SIZE statement,the default selection method is probability proportional to size without replacement(METHOD=PPS). Valid values fornameare as follows:

PPS requests selection with probability proportional to size andwithout replacement. See the section “PPS Sampling with-out Replacement” on page 3300 for details. If you specifyMETHOD=PPS, you must name the size measure variable inthe SIZE statement.

SAS OnlineDoc: Version 8

Page 16: The SURVEYSELECT Procedure

3288 � Chapter 63. The SURVEYSELECT Procedure

PPS–BREWER | BREWER requests selection according to Brewer’s method.Brewer’s method selects two units from each stratum with proba-bility proportional to size and without replacement. See the section“Brewer’s PPS Method” on page 3304 for details. If you spec-ify METHOD=PPS–BREWER, you must name the size measurevariable in the SIZE statement. You do not need to specify thesample size with the SAMPSIZE= option, since Brewer’s methodselects two units from each stratum.

PPS–MURTHY | MURTHY requests selection according to Murthy’s method.Murthy’s method selects two units from each stratum with proba-bility proportional to size and without replacement. See the section“Murthy’s PPS Method” on page 3305 for details. If you spec-ify METHOD=PPS–MURTHY, you must name the size measurevariable in the SIZE statement. You do not need to specify thesample size with the SAMPSIZE= option, since Murthy’s methodselects two units from each stratum.

PPS–SAMPFORD | SAMPFORD requests selection according to Sampford’smethod. Sampford’s method selects units with probability pro-portional to size and without replacement. See the section “Samp-ford’s PPS Method” on page 3306 for details. If you specifyMETHOD=PPS–SAMPFORD, you must name the size measurevariable in the SIZE statement.

PPS–SEQ | CHROMY requests sequential selection with probability proportional tosize and with minimum replacement. This method is also known asChromy’s method. See the section “PPS Sequential Sampling” onpage 3303 for details. If you specify METHOD=PPS–SEQ, youmust name the size measure variable in the SIZE statement.

PPS–SYS requests systematic selection with probability proportional to size.See the section “PPS Systematic Sampling” on page 3302 for de-tails on this method. If you specify METHOD=PPS–SYS, youmust name the size measure variable in the SIZE statement.

PPS–WR requests selection with probability proportional to size and withreplacement. See the section “PPS Sampling with Replacement”on page 3302 for details on this method. If you specifyMETHOD=PPS–WR, you must name the size measure variable inthe SIZE statement.

SEQ requests sequential selection according to Chromy’s method. Ifyou specify METHOD=SEQ and do not specify a size measurewith the SIZE statement, PROC SURVEYSELECT uses sequentialzoned selection with equal probability and without replacement.See the section “Sequential Random Sampling” on page 3299for details on this method. If you specify METHOD=SEQand also name a size measure in the SIZE statement, PROCSURVEYSELECT uses METHOD=PPS–SEQ, which is sequen-tial selection with probability proportional to size and with mini-mum replacement.

SAS OnlineDoc: Version 8

Page 17: The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement � 3289

See the section “PPS Sequential Sampling” on page 3303 for de-tails on this method.

SRS requests simple random sampling, which is selection with equalprobability and without replacement. See the section “Simple Ran-dom Sampling” on page 3298 for details. This method is the de-fault if you do not specify the METHOD= option and also do notspecify a SIZE statement.

SYS requests systematic random sampling. If you specifyMETHOD=SYS and do not specify a size measure with theSIZE statement, PROC SURVEYSELECT uses systematic selec-tion with equal probability. See the section “Systematic RandomSampling” on page 3299 for details on this method. If you spec-ify METHOD=SYS and also name a size measure in the SIZEstatement, PROC SURVEYSELECT uses METHOD=PPS–SYS,which is systematic selection with probability proportional to size.See the section “PPS Systematic Sampling” on page 3302 fordetails.

URS requests unrestricted random sampling, which is selection withequal probability and with replacement. See the section “Unre-stricted Random Sampling” on page 3298 for details.

MINSIZErequests size measure adjustment by the stratum minimum size measures, whichyou provide in the secondary input data set variable–MINSIZE– . Use theMINSIZE option when you have already named the secondary input data setin another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set,MAXSIZE=SAS-data-set, or CERTSIZE=SAS-data-set. You can name only one sec-ondary input data set in each invocation of the procedure.

If any size measure is less than the minimum size measure for its stratum, then PROCSURVEYSELECT adjusts this size measure upward to equal the minimum size mea-sure. Each minimum size measure must be a positive number. The MINSIZE optionis available whenever you specify a SIZE statement for probability proportional tosize selection and a STRATA statement for stratification.

If you want to specify a single minimum size value in the PROC SURVEYSELECTstatement, use the MINSIZE=minoption.

MINSIZE=minspecifies the minimum allowable size measure. If any size measure is less than thevaluemin, then PROC SURVEYSELECT adjusts this size measure upward to equalmin. The minimum size measure must be a positive number. This option is availablewhenever you specify a SIZE statement for selection with probability proportional tosize.

If you request a stratified sample design with a STRATA statement and specify theMINSIZE= option, PROC SURVEYSELECT uses the minimum sizemin for allstrata. If you do not want to use the same minimum size for all strata, use theMINSIZE=SAS-data-setoption to specify a minimum size for each stratum.

SAS OnlineDoc: Version 8

Page 18: The SURVEYSELECT Procedure

3290 � Chapter 63. The SURVEYSELECT Procedure

MINSIZE=SAS-data-setnames a SAS data set that contains the minimum allowable size measures for thestrata. If any size measure is less than the minimum size measure for its stratum, thenPROC SURVEYSELECT adjusts this size measure upward to equal the minimumsize measure. Each minimum size measure must be a positive number. This option isavailable whenever you specify a SIZE statement for probability proportional to sizeselection and a STRATA statement for stratified selection.

The MINSIZE= input data set should contain all the STRATA variables, with thesame type and length as in the DATA= data set. The STRATA groups should appear inthe same order in the MINSIZE= data set as in the DATA= data set. The MINSIZE=data set should have a variable named–MINSIZE– that contains the minimum sizemeasure for each stratum.

NMAX=nspecifies the maximum stratum sample sizen for the SAMPRATE= option. Whenyou specify the SAMPRATE= option, PROC SURVEYSELECT calculates the de-sired stratum sample size from the specified sampling rate and the total number ofunits in the stratum. If this sample size is greater than the value NMAX=n, thenPROC SURVEYSELECT selects the maximum ofn units.

The maximum sample sizen must be a positive integer. The NMAX= option isavailable only with the SAMPRATE= option, which may be used with equal prob-ability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, andMETHOD=SEQ).

NMIN=nspecifies the minimum stratum sample sizen for the SAMPRATE= option. Whenyou specify the SAMPRATE= option, PROC SURVEYSELECT calculates the de-sired stratum sample size from the specified sampling rate and the total number ofunits in the stratum. If this sample size is less than the value NMIN=n, then PROCSURVEYSELECT selects the minimum ofn units.

The minimum sample sizen must be a positive integer. The NMIN= option isavailable only with the SAMPRATE= option, which may be used with equal prob-ability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, andMETHOD=SEQ).

NOPRINTsuppresses the display of all output. You can use the NOPRINT option when youwant only to create an output data set. Note that this option temporarily disables theOutput Delivery System (ODS). For more information, see Chapter 15, “Using theOutput Delivery System.”

OUT=SAS-data-setnames the output data set that contains the sample. If you omit the OUT= option,the data set is named DATAn, wheren is the smallest integer that makes the nameunique.

The output data set contains the units selected for the sample, as well as design infor-mation and selection statistics, depending on the selection method and output options

SAS OnlineDoc: Version 8

Page 19: The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement � 3291

you specify. See the descriptions for the options JTPROBS, OUTHITS, OUTSIZE,and STATS. For information on the contents of the output data set, see the section“Output Data Set” on page 3306.

OUTHITSincludes a separate observation in the output data set for each selection when the sameunit is selected more than once. By default, the output data set contains only one ob-servation for each selected unit, even if it is selected more than once, and the variableNumberHits contains the number of hits or selections for that unit. The OUTHITSoption is available for selection methods that select with replacement or with min-imum replacement (METHOD=URS, METHOD=PPS–WR, METHOD=PPS–SYS,and METHOD=PPS–SEQ).

OUTSIZEincludes additional design and sampling frame parameters in the output data set. Ifyou specify the OUTSIZE option, PROC SURVEYSELECT includes the sample sizeor sampling rate in the output data set. When you request the OUTSIZE option andalso specify the SIZE statement, the procedure outputs the size measure total for thesampling frame. If you do not specify the SIZE statement, the procedure outputs thetotal number of sampling units in the frame. Also, PROC SURVEYSELECT includesthe minimum size measure if you specify the MINSIZE= option, the maximum sizemeasure if you specify the MAXSIZE= option, and the certainty size measure if youspecify the CERTSIZE= option.

If you have a stratified design, the output data set includes the stratum-level values ofthese parameters. Otherwise, the output data set includes the overall population-levelvalues.

For information on the contents of the output data set, see the section “Output DataSet” on page 3306.

OUTSORT=SAS-data-setnames an output data set that contains the sorted input data set. This option isavailable when you specify a CONTROL statement for systematic or sequentialselection methods (METHOD=SYS, METHOD=PPS–SYS, METHOD=SEQ, andMETHOD=PPS–SEQ). PROC SURVEYSELECT sorts the input data set by theCONTROL variables within strata before selecting the sample.

If you specify CONTROL variables but do not name an output data set with theOUTSORT= option, then the sorted data set replaces the input data set.

REP=nrepspecifies the number of sample replicates. If you specify the REP= option, PROCSURVEYSELECT selectsnrep independent samples, each with the same specifiedsample size or sampling rate and the same sample design.

You can use replicated sampling to provide a simple method of variance estimationfor any form of statistic, as well as to evaluate variable nonsampling errors such asinterviewer differences. Refer to Kish (1965), Kish (1987), and Kalton (1983) forinformation on replicated sampling.

SAS OnlineDoc: Version 8

Page 20: The SURVEYSELECT Procedure

3292 � Chapter 63. The SURVEYSELECT Procedure

SAMPRATE= rRATE=r

specifies the sampling rate, which is the proportion of units selected for the sam-ple. The sampling rater must be a positive number. You can specifyr as a numberbetween 0 and 1. Or you can specifyr in percentage form as a number between 1and 100, and PROC SURVEYSELECT converts that number to a proportion. Theprocedure treats the value 1 as 100%, and not the percentage form 1%.

The SAMPRATE= option is available only for equal probability selection meth-ods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). Forsystematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses theinverse of the sampling rater as the interval. See the section “Systematic RandomSampling” on page 3299 for details. For other selection methods, PROC SURVEYS-ELECT converts the sampling rater to the sample size before selection, multiplyingthe rate by the number of units in the stratum or frame and rounding up to the nearestinteger.

If you request a stratified sample design with a STRATA statement and specify theSAMPRATE=r option, PROC SURVEYSELECT uses the sampling rater for eachstratum. If you do not want to use the same sampling rate for each stratum, use theSAMPRATE=(values) option or the SAMPRATE=SAS-data-setoption to specify asampling rate for each stratum.

SAMPRATE=(values)RATE=(values)

specifies sampling rates for the strata. You can separatevalueswith blanks or com-mas. The number of SAMPRATE= values must equal the number of strata in theinput data set.

List the stratum sampling rate values in the order in which the strata appear in theinput data set. If you use the SAMPRATE=(values) option, the input data set must besorted by the STRATA variables in ascending order. You cannot use the DESCEND-ING or NOTSORTED options in the STRATA statement.

Each stratum sampling rate value must be a positive number. You can specify eachvalue as a number between 0 and 1. Or you can specify a value in percentage form asa number between 1 and 100, and PROC SURVEYSELECT converts that number toa proportion. The procedure treats the value 1 as 100%, and not the percentage form1%.

The SAMPRATE= option is available only for equal probability selection meth-ods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). Forsystematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses theinverse of the stratum sampling rate as the interval for the stratum. See the section“Systematic Random Sampling” on page 3299 for details on systematic sampling.For other selection methods, PROC SURVEYSELECT converts the stratum samplingrate to a stratum sample size before selection, multiplying the rate by the number ofunits in the stratum and rounding up to the nearest integer.

SAS OnlineDoc: Version 8

Page 21: The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement � 3293

SAMPRATE=SAS-data-setRATE=SAS-data-set

names a SAS data set that contains sampling rates for the strata. This input dataset should contain all the STRATA variables, with the same type and length as inthe DATA= data set. The STRATA groups should appear in the same order in theSAMPSIZE= data set as in the DATA= data set. The SAMPRATE= data set shouldhave a variable–RATE– that contains the sampling rate for each stratum.

Each sampling rate value must be a positive number. You can specify each valueas a number between 0 and 1. Or you can specify a value in percentage form as anumber between 1 and 100, and PROC SURVEYSELECT converts that number to aproportion. The procedure treats the value 1 as 100%, and not the percentage form1%.

The SAMPRATE= option is available only for equal probability selection meth-ods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). Forsystematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses theinverse of the stratum sampling rate as the interval for the stratum. See the sec-tion “Systematic Random Sampling” on page 3299 for details. For other selectionmethods, PROC SURVEYSELECT converts the stratum sampling rate to the stra-tum sample size before selection, multiplying the rate by the number of units in thestratum and rounding up to the nearest integer.

SAMPSIZE=nN=n

specifies the sample size, which is the number of units selected for the sample. Thesample sizen must be a positive integer. For methods that select without replacement,the sample sizen must not exceed the number of units in the input data set.

If you request a stratified sample design with a STRATA statement and specifythe SAMPSIZE=n option, PROC SURVEYSELECT selectsn units from each stra-tum. For methods that select without replacement, the sample sizen must not ex-ceed the number of units in any stratum. If you do not want to select the samenumber of units from each stratum, use the SAMPSIZE=(values) option or theSAMPSIZE=SAS-data-setoption to specify different sample sizes for the strata.

SAMPSIZE=(values)N=(values)

specifies sample sizes for the strata. You can separatevalueswith blanks or commas.The number of SAMPSIZE= values must equal the number of strata in the input dataset.

List the stratum sample size values in the order in which the strata appear in the inputdata set. If you use the SAMPSIZE=(values) option, the input data set must be sortedby the STRATA variables in ascending order. You cannot use the DESCENDING orNOTSORTED options in the STRATA statement.

Each stratum sample size value must be a positive integer. For methods that selectwithout replacement, the sample size for a stratum must not exceed the number ofunits in that stratum.

SAS OnlineDoc: Version 8

Page 22: The SURVEYSELECT Procedure

3294 � Chapter 63. The SURVEYSELECT Procedure

SAMPSIZE=SAS-data-setN=SAS-data-set

names a SAS data set that contains the sample sizes for the strata. This input dataset should contain all the STRATA variables, with the same type and length as in theDATA= data set. The STRATA groups should appear in the same order in the SAMP-SIZE= data set as in the DATA= data set. The SAMPSIZE= data set should have avariable–NSIZE– that contains the sample size for each stratum. Each sample sizevalue must be a positive integer. For methods that select without replacement, thestratum sample size must not exceed the number of units in the stratum.

SEED=numberspecifies the initial seed for random number generation. The value of the SEED=option must be a positive integer. If you do not specify the SEED= option, PROCSURVEYSELECT uses the time of day from the computer’s clock to obtain the initialseed.

SORT=NEST | SERPspecifies the type of sorting by CONTROL variables. The option SORT=NESTrequests nested sorting, and SORT=SERP requests hierarchic serpentine sorting.The default is SORT=SERP. See the section “Sorting by CONTROL Variables” onpage 3296 for descriptions of serpentine and nested sorting. Where there is only oneCONTROL variable, the two types of sorting are equivalent.

This option is available when you specify a CONTROL statement for system-atic or sequential selection methods (METHOD=SYS, METHOD=PPS–SYS,METHOD=SEQ, and METHOD=PPS–SEQ). PROC SURVEYSELECT sorts theinput data set by the CONTROL variables within strata before selecting the sample.

STATSincludes selection probabilities and sampling weights in the OUT= output data setfor equal probability selection methods when you do not specify a STRATA state-ment. This option is available for the folowing equal probability selection methods:METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ. For PPSselection methods and stratified designs, the output data set contains selection proba-bilities and sampling weights by default. For more information on the contents of theoutput data set, see the section “Output Data Set” on page 3306.

CONTROL Statement

CONTROL variables ;

The CONTROL statement names variables for sorting the input data set. The CON-TROL variables can be character or numeric.

PROC SURVEYSELECT sorts the input data set by the CONTROL variablesbefore selecting the sample. If you also specify a STRATA statement, PROCSURVEYSELECT sorts by CONTROL variables within strata. Control sorting is

SAS OnlineDoc: Version 8

Page 23: The SURVEYSELECT Procedure

STRATA Statement � 3295

available for systematic and sequential selection methods (METHOD=SYS,METHOD=PPS–SYS, METHOD=SEQ, and METHOD=PPS–SEQ).

By default, PROC SURVEYSELECT uses hierarchic serpentine sorting by the CON-TROL variables. If you specify the SORT=NEST option, the procedure uses nestedsorting. See the description for the SORT= option. For more information on ser-pentine and nested sorting, see the section “Sorting by CONTROL Variables” onpage 3296.

You can use the OUTSORT= option to name an output data set that contains thesorted input data set. If you do not specify the OUTSORT= option when you use theCONTROL statement, then the sorted data set replaces the input data set.

ID Statement

ID variables ;

The ID statement names variables from the DATA= input data set to be includedin the OUT= data set of selected units. If there is no ID statement, PROCSURVEYSELECT includes all variables from the DATA= data set in the OUT= dataset. The ID variables can be character or numeric.

SIZE Statement

SIZE variable ;

The SIZE statement names one and only one size measure variable, which containsthe size measures to be used when sampling with probability proportional to size. TheSIZE variable must be numeric. When the value of an observation’s SIZE variableis missing or nonpositive, that observation has no chance of being selected for thesample.

The SIZE statement is required for all PPS selection methods, which in-clude METHOD=PPS, METHOD=PPS–BREWER, METHOD=PPS–MURTHYMETHOD=PPS–SAMPFORD, METHOD=PPS–SEQ, METHOD=PPS–SYS, andMETHOD=PPS–WR.

STRATA Statement

STRATA variables ;

You can specify a STRATA statement with PROC SURVEYSELECT to partitionthe input data set into nonoverlapping groups defined by the STRATA variables.PROC SURVEYSELECT then selects independent samples from these strata, ac-cording to the selection method and design parameters specified in the PROCSURVEYSELECT statement.

SAS OnlineDoc: Version 8

Page 24: The SURVEYSELECT Procedure

3296 � Chapter 63. The SURVEYSELECT Procedure

For information on the use of stratification in sample design, refer to Kalton (1983),Kish (1987), and Cochran (1977).

Thevariablesare one or more variables in the input data set. The STRATA variablesfunction much like BY variables, and PROC SURVEYSELECT expects the inputdata set to be sorted in order of the STRATA variables.

If you specify a CONTROL statement, or if you specify METHOD=PPS, the inputdata set must be sorted in ascending order of the STRATA variables. So you can-not use the STRATA option NOTSORTED or DESCENDING when you specify aCONTROL statement or METHOD=PPS.

If your input data set is not sorted by the STRATA variables in ascending order, useone of the following alternatives:

� Sort the data using the SORT procedure with the STRATA variables in a BYstatement.

� Specify the option NOTSORTED or DESCENDING in the STRATA statementfor the SURVEYSELECT procedure (when you do not specify a CONTROLstatement or METHOD=PPS). The NOTSORTED option does not mean thatthe data are unsorted but rather that the data are arranged in groups (accordingto values of the STRATA variables) and that these groups are not necessarilyin alphabetical or increasing numeric order.

� Create an index on the STRATA variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion inSAS LanguageReference: Concepts. For more information on the DATASETS procedure, refer tothe discussion in theSAS Procedures Guide.

Details

Missing Values

If an observation has a missing or nonpositive value for the SIZE variable, PROCSURVEYSELECT excludes that observation from the sample selection. The proce-dure writes a note to the log giving the number of observations omitted due to missingor nonpositive size measures.

PROC SURVEYSELECT treats missing STRATA variable values like any otherSTRATA variable value. The missing values form a separate stratum.

If a value of–NSIZE– is missing in the SAMPSIZE= input data set, then PROCSURVEYSELECT writes an error message to the log and does not select a samplefrom that stratum. The procedure treats missing values of–NRATE– , –MINSIZE– ,

–MAXSIZE– , and–CERTSIZE– similarly.

Sorting by CONTROL Variables

If you specify a CONTROL statement, PROC SURVEYSELECT sorts the input dataset by the CONTROL variables before selecting the sample. If you also specify a

SAS OnlineDoc: Version 8

Page 25: The SURVEYSELECT Procedure

Sample Selection Methods � 3297

STRATA statement, the procedure sorts by CONTROL variables within strata. Sort-ing by CONTROL variables is available for systematic and sequential selection meth-ods, which include METHOD=SYS, METHOD=PPS–SYS, METHOD=SEQ, andMETHOD=PPS–SEQ. Sorting provides additional control over the distribution ofthe sample, giving some benefits of proportionate stratification.

By default, the sorted data set replaces the input data set. Or you can use theOUTSORT= option to name an output data set that contains the sorted input dataset.

PROC SURVEYSELECT provides two types of sorting, nested sorting and hierarchicserpentine sorting. If you specify the SORT=NEST option, then the procedure sortsby the CONTROL variables according to nested sorting. If you do not specify theSORT=NEST option, the procedure uses serpentine sorting by default. These twotypes of sorting are equivalent when there is only one CONTROL variable.

If you request nested sorting, PROC SURVEYSELECT sorts observations in thesame order as PROC SORT does for an ascending sort by the CONTROL vari-ables. Refer to the chapter on the SORT procedure in theSAS Procedures Guide.PROC SURVEYSELECT sorts within strata if you also specify a STRATA state-ment. The procedure first arranges the input observations in ascending order of thefirst CONTROL variable. Then within each level of the first control variable, theprocedure arranges the observations in ascending order of the second CONTROLvariable. This continues for all CONTROL variables specified.

In hierarchic serpentine sorting, PROC SURVEYSELECT sorts by the firstCONTROL variable in ascending order. Then within the first level of the firstCONTROL variable, the procedure sorts by the second CONTROL variable in as-cending order. Within the second level of the first CONTROL variable, the proceduresorts by the second CONTROL variable in descending order. Sorting by the secondCONTROL variable continues to alternate between ascending and descending sortingthroughout all levels of the first CONTROL variable. If there is a third CONTROLvariable, the procedure sorts by that variable within levels formed from the first twoCONTROL variables, again alternating between ascending and descending sorting.This continues for all CONTROL variables specified. This sorting algorithm mini-mizes the change from one observation to the next with respect to the CONTROLvariable values, thus making nearby observations more similar. For more informationon serpentine sorting, refer to Chromy (1979) and Williams and Chromy (1980).

Sample Selection Methods

PROC SURVEYSELECT provides a variety of methods for selecting probability-based random samples. With probability sampling, each unit in the survey populationhas a known, positive probablity of selection. This property of probability samplingavoids selection bias and enables you to use statistical theory to make valid inferencesfrom the sample to the survey population. Refer to Kish (1987, 1965), Kalton (1983),and Cochran (1977) for more information on probability sampling.

In equal probability sampling, each unit in the sampling frame, or in a stratum, hasthe same probability of being selected for the sample. PROC SURVEYSELECT

SAS OnlineDoc: Version 8

Page 26: The SURVEYSELECT Procedure

3298 � Chapter 63. The SURVEYSELECT Procedure

provides the following methods that select units with equal probability: simple ran-dom sampling, unrestricted random sampling, systematic random sampling, and se-quential random sampling. In simple random sampling, units are selectedwithoutreplacement, which means that a unit cannot be selected more than once. Both sys-tematic and sequential equal probability sampling are also without replacement. Inunrestricted random sampling, units are selectedwith replacement, which means thata unit can be selected more than once. In with-replacement sampling, thenumber ofhits refers to the number of times a unit is selected.

In probability proportional to size (PPS) sampling, a unit’s selection probability isproportional to its size measure. PROC SURVEYSELECT provides the followingmethods that select units with probability proportional to size (PPS): PPS samplingwithout replacement, PPS sampling with replacement, PPS systematic sampling, PPSsequential sampling, Brewer’s method, Murthy’s method, and Sampford’s method.PPS sampling is often used in cluster sampling, where you select clusters (or groupsof sampling units) of varying size in the first stage of selection. For example, clustersmay be schools, hospitals, or geographical areas, and the final sampling units maybe students, patients, or citizens. Cluster sampling can provide efficiencies in frameconstruction and other survey operations. Refer to Kalton (1983), Kish (1965), andthe other references cited in the following sections for more information.

The following sections give detailed descriptions of the sample selection methodsavailable in PROC SURVEYSELECT. In these sections,nh denotes the sample size(the number of units in the sample) for stratumh, andNh denotes the populationsize (number of units in the population) for stratumh, for h = 1; 2; : : : ;H. Whenthe sample design is not stratified,n denotes the sample size, andN denotes thepopulation size. For PPS sampling,Mhi represents the size measure for uniti instratumh, Mh� is the total of all size measures for the population of stratumh, andZhi = Mhi=Mh is the relative size of uniti in stratumh.

Simple Random SamplingThe method of simple random sampling (METHOD=SRS) selects units with equalprobability and without replacement. Each possible sample ofn different units out ofN has the same probability of being selected. The selection probability for each unitequalsn=N . When you request stratified sampling with a STRATA statement, PROCSURVEYSELECT selects samples independently within strata. The selection proba-bility for a unit in stratumh equalsnh=Nh for stratified simple random sampling.

PROC SURVEYSELECT uses Floyd’s ordered hash table algorithm for simple ran-dom sampling. This algorithm is fast, efficient, and appropriate for large data sets.Refer to Bentley and Floyd (1987) and Bentley and Knuth (1986). For additionalinformation on simple random sampling algorithms, refer to McLeod and Bellhouse(1983) and Fann, Muller, and Rezucha (1962).

Unrestricted Random SamplingThe method of unrestricted random sampling (METHOD=URS) selects units withequal probability and with replacement. Because units are selected with replace-ment, a unit can be selected for the sample more than once. The expected number ofselections or hits for each unit equalsn=N when sampling without stratification. Forstratified sampling, the expected number of hits for a unit in stratumh equalsnh=Nh.

SAS OnlineDoc: Version 8

Page 27: The SURVEYSELECT Procedure

Sequential Random Sampling � 3299

Systematic Random SamplingThe method of systematic random sampling (METHOD=SYS) selects units at a fixedinterval throughout the sampling frame or stratum after a random start. PROC SUR-VEYSELECT chooses the first unit randomly from the entire stratum and then treatsthe stratum observations as a closed loop. This is done to obtain an unbiased varianceestimator, as suggested by Lahiri (Murthy 1967). If you specify the sample size (orthe stratum sample sizes) with the SAMPSIZE= option, PROC SURVEYSELECTuses a fractional interval to provide exactly the specified sample size. The intervalequalsN=n, orNh=nh for stratified sampling. The selection probability for each unitequalsn=N , or nh=Nh for stratified sampling. If you specify the sampling rate (orthe stratum sampling rates) with the SAMPRATE= option, PROC SURVEYSELECTuses the inverse of the rate as the interval for systematic selection. The selection prob-ability for each unit equals the specified rate.

Systematic random sampling controls the distribution of the sample by spreading itthroughout the sampling frame or stratum at equal intervals, thus providing implicitstratification. You can use the CONTROL statement to order the input data set by theCONTROL variables before sample selection. If you also use a STRATA statement,PROC SURVEYSELECT sorts by the CONTROL variables within strata. If youdo not specify a CONTROL statement, PROC SURVEYSELECT applies systematicselection to the observations in the order in which they appear in the input data set.

Sequential Random SamplingIf you specify the option METHOD=SEQ and do not include a SIZE statement,PROC SURVEYSELECT uses the equal probability version of Chromy’s method forsequential random sampling. This method selects units sequentially with equal prob-ability and without replacement. Refer to Chromy (1979) and Williams and Chromy(1980). See the section “PPS Sequential Sampling” on page 3303 for a descriptionof Chromy’s PPS selection method.

Sequential random sampling controls the distribution of the sample by spreading itthroughout the sampling frame or stratum, thus providing implicit stratification ac-cording to the order of units in the frame or stratum. You can use the CONTROLstatement to sort the input data set by the CONTROL variables before sample selec-tion. If you also use a STRATA statement, PROC SURVEYSELECT sorts by theCONTROL variables within strata. By default, the procedure uses hierarchic serpen-tine ordering for sorting. If you specify the SORT=NEST option, the procedure usesnested sorting. See the section “Sorting by CONTROL Variables” on page 3296 fordescriptions of serpentine and nested sorting. If you do not specify a CONTROLstatement, PROC SURVEYSELECT applies sequential selection to the observationsin the order in which they appear in the input data set.

SAS OnlineDoc: Version 8

Page 28: The SURVEYSELECT Procedure

3300 � Chapter 63. The SURVEYSELECT Procedure

Following Chromy’s method of sequential selection, PROC SURVEYSELECT ran-domly chooses a starting unit from the entire stratum (or frame, if the design is notstratified). Using this unit as the first one, the procedure treats the stratum units as aclosed loop. This is done so that all pairwise (joint) selection probabilities are pos-itive and an unbiased variance estimator can be obtained. The procedure numbersunits sequentially from the random start to the end of the stratum and then continuesfrom the beginning of the stratum until all units are numbered.

Beginning with the randomly chosen starting unit, PROC SURVEYSELECT accu-mulates the expected number of selections or hits, where the expected number ofselectionsEShi equalsnh=Nh for all unitsi in stratumh. The procedure computes

Ihi = Int (

iXj=1

EShj) = Int (i nh=Nh)

Fhi = Frac (

iXj=1

EShj) = Frac (i nh=Nh)

whereInt denotes the integer part of the number, andFrac denotes the fractionalpart.

Considering each unit sequentially, Chromy’s method determines whether uniti isselected by comparing the total number of selections for the firsti� 1 units,

Th(i�1) =

i�1Xj=1

Shj

with the value ofIh(i�1) .

If Th(i�1) = Ih(i�1) , Chromy’s method determines whether or not uniti is selectedas follows. IfFhi = 0 or Fh(i�1) > Fhi , then uniti is selected with certainty.Otherwise, uniti is selected with probability

(Fhi � Fh(i�1)) = (1� Fh(i�1))

If Th(i�1) = Ih(i�1) + 1 , Chromy’s method determines whether or not uniti isselected as follows. IfFhi = 0 or Fhi > Fh(i�1) , then the unit is not selected.Otherwise, uniti is selected with probability

Fhi = Fh(i�1)

PPS Sampling without ReplacementIf you specify the option METHOD=PPS, PROC SURVEYSELECT selects unitswith probability proportional to size and without replacement. The selection proba-bility for unit i in stratumh equalsnhZhi . The procedure uses the Hanurav-Vijayanalgorithm for PPS selection without replacement. Hanurav (1967) introduced this al-gorithm for the selection of two units per stratum, and Vijayan (1968) generalized it

SAS OnlineDoc: Version 8

Page 29: The SURVEYSELECT Procedure

PPS Sampling without Replacement � 3301

for the selection of more than two units. The algorithm enables computation of jointselection probabilities and provides joint selection probability values that usually en-sure nonnegativiy and stability of the Sen-Yates-Grundy variance estimator. Refer toFox (1989), Golmant (1990), and Watts (1991) for details.

Notation in the remainder of this section drops the stratum subscripth for simplicity,but selection is still done independently within strata if you specify a stratified design.For a stratified design,n now denotes the sample size for the current stratum,Ndenotes the stratum population size, andMi denotes the size measure for uniti inthe stratum. If the design is not stratified, this notation applies to the entire samplingframe.

According to the Hanurav-Vijayan algorithm, PROC SURVEYSELECT first ordersunits within the stratum in ascending order by size measure, so thatM1 � M2 �

: : : �MN . Then the procedure selects the PPS sample ofn observations as follows:

1. The procedure randomly chooses one of the integers1; 2; : : : ; n with probabil-ity �1; �2; : : : ; �n, where

�i = n (ZN�n+i+1 � ZN�n+i) (T + i ZN�n+1) = T

Zj = Mj=M , T =PN�n

j=1 Zj, and, by definition,ZN+1 = 1=n to ensurethat

Pni=1 �i = 1 .

2. If i is the integer selected in step 1, the procedure includes the last(n� i) unitsof the stratum in the sample, where the units are ordered by size measure as de-scribed previously. The procedure then selects the remainingi units accordingto steps 3 through 6 below.

3. The procedure defines new normed size measures for the remaining(N�n+i)stratum units that were not selected in steps 1 and 2,

Z�

j = Zj = (T + i ZN�n+1) for j = 1; : : : ; N � n+ 1

Z�

j = ZN�n+1 = (T + i ZN�n+1) for j = N � n+ 2; : : : ; N � n+ i

4. The procedure selects the next unit from the first(N � n + 1) stratum unitswith probability proportional toaj(1), where

a1(1) = i Z�

1

aj(1) = i Z�

j

Qj�1k=1[1� (i� 1) Pk] for j = 2; : : : ; N � n+ 1

andPk = Mk=(Mk+1 +Mk+2 + � � � +MN�n+i) .

5. If stratum unitj1 is the unit selected in step 4, then the procedure selects thenext unit from unitsj1 +1 throughN �n+2 with probability proportional toaj(2; j1), where

aj1+1(2; j1) = (i� 1) Z�

j1+1

aj(2; j1) = (i� 1) Z�

j

j�1Yk=j1+1

[1� (i� 2) Pk] for j = j1 + 2; : : : ; N � n+ 2

SAS OnlineDoc: Version 8

Page 30: The SURVEYSELECT Procedure

3302 � Chapter 63. The SURVEYSELECT Procedure

6. The procedure repeats step 5 until alln sample units are selected.

If you request the JTPROBS option, PROC SURVEYSELECT computes the joint se-lection probabilities for all pairs of selected units in each stratum. The joint selectionprobability for unitsi andj in the stratum equals

P(ij) =nX

r=1

�rK(r)ij

where

K(r)ij = 1 N � n+ r < i � N � 1

= r ZN�n+1 = (T + r ZN�n+1) N � n < i � N � n+ r; j > N � n+ r

= r Zi = (T + r ZN�n+1) 1 � i � N � n; j > N � n+ r

= �(r)ij j � N � n+ r

and

�(r)ij =

r(r � 1)

2Pi Zj

i�1Yk=1

(1� Pk)

wherePk = Mk=(Mk+1 +Mk+2 + � � �+MN�n+r) .

PPS Sampling with ReplacementIf you specify the option METHOD=PPS–WR, PROC SURVEYSELECT selectsunits with probability proportional to size and with replacement. The proceduremakesnh independent random selections from the stratum ofNh units, selectingwith probabilityZhi = Mhi=Mh� . Because units are selected with replacement, aunit can be selected for the sample more than once. The expected number of selec-tions or hits for uniti in stratumh equalsnhZhi . If you request the JTPROBS option,PROC SURVEYSELECT computes the joint expected number of hits for all pairs ofselected units in each stratum. The joint expected number of hits for unitsi andj instratumh equals

Ph(ij) =nh(nh � 1)

2Zhi Zhj

PPS Systematic SamplingIf you specify the option METHOD=PPS–SYS, PROC SURVEYSELECT selectsunits by systematic random sampling with probability proportional to size. System-atic sampling selects units at a fixed interval throughout the stratum or sampling frameafter a random start. PROC SURVEYSELECT chooses the first unit randomly fromthe entire stratum with probability proportional to size and then treats the stratumobservations as a closed loop. This is done to obtain an unbiased variance estimator,as suggested by Lahiri (Murthy 1967). PROC SURVEYSELECT uses a fractional

SAS OnlineDoc: Version 8

Page 31: The SURVEYSELECT Procedure

PPS Sequential Sampling � 3303

interval to provide exactly the specified sample size. The interval equalsMh�=nh forstratified sampling andM=n for sampling without stratification. Depending on thesample size and the values of the size measures, it may be possible for a unit to be se-lected more than once. The expected number of selections or hits for uniti in stratumh equalsnhMhi=Mh� = nhZhi . Refer to Cochran (1977, pp. 265–266) and Madow(1949).

Systematic random sampling controls the distribution of the sample by spreading itthroughout the sampling frame or stratum at equal intervals, thus providing implicitstratification. You can use the CONTROL statement to order the input data set by theCONTROL variables before sample selection. If you also use a STRATA statement,PROC SURVEYSELECT sorts by the CONTROL variables within strata. If youdo not specify a CONTROL statement, PROC SURVEYSELECT applies systematicselection to the observations in the order in which they appear in the input data set.

PPS Sequential SamplingIf you specify the option METHOD=PPS–SEQ, PROC SURVEYSELECT usesChromy’s method of sequential random sampling. Refer to Chromy (1979) andWilliams and Chromy (1980). Chromy’s method selects units sequentially with prob-ability proportional to size and with minimum replacement. Selectionwith minimumreplacementmeans that the actual number of hits for a unit can equal the integer partof the expected number of hits for that unit, or the next largest integer. This canbe compared to selectionwithout replacement, where each unit can be selected onlyonce, so the number of hits can equal 0 or one. The other alternative is selectionwithreplacement, where there is no restriction on the number of hits for each unit, so thenumber of hits can equal0; 1; � � � ; nh, wherenh is the stratum sample size.

Sequential random sampling controls the distribution of the sample by spreading itthroughout the sampling frame or stratum, thus providing implicit stratification ac-cording to the order of units in the frame or stratum. You can use the CONTROLstatement to sort the input data set by the CONTROL variables before sample selec-tion. If you also use a STRATA statement, PROC SURVEYSELECT sorts by theCONTROL variables within strata. By default, the procedure uses hierarchic serpen-tine ordering to sort the sampling frame by the CONTROL variables within strata. Ifyou specify the SORT=NEST option, the procedure uses nested sorting. See the sec-tion “Sorting by CONTROL Variables” on page 3296 for descriptions of serpentineand nested sorting. If you do not specify a CONTROL statement, PROC SURVEY-SELECT applies sequential selection to the observations in the order in which theyappear in the input data set.

According to Chromy’s method of sequential selection, PROC SURVEYSELECTfirst chooses a starting unit randomly from the entire stratum, with probability pro-portional to size. The procedure uses this unit as the first one and treats the stratumobservations as a closed loop. This is done so that all pairwise (joint) expected num-ber of hits are positive and an unbiased variance estimator can be obtained. Theprocedure numbers observations sequentially from the random start to the end of thestratum and then continues from the beginning of the stratum until all units are num-bered.

SAS OnlineDoc: Version 8

Page 32: The SURVEYSELECT Procedure

3304 � Chapter 63. The SURVEYSELECT Procedure

Beginning with the randomly chosen starting unit, Chromy’s method partitions theordered stratum sampling frame intonh zones of equal size. There is one selectionfrom each zone and a total ofnh selections or hits, although fewer thannh distinctunits may be selected. Beginning with the random start, the procedure accumulatesthe expected number of hits and computes

EShi = nhZhi

Ihi = Int (

iXj=1

EShj)

Fhi = Frac (

iXj=1

EShj)

whereEShi represents the expected number of hits for uniti in stratumh; Int de-notes the integer part of the number; andFrac denotes the fractional part.

Considering each unit sequentially, Chromy’s method determines the actual numberof hits for uniti by comparing the total number of hits for the firsti� 1 units,

Th(i�1) =i�1Xj=1

Shj

with the value ofIh(i�1) .

If Th(i�1) = Ih(i�1) , Chromy’s method determines the total number of hits for thefirst i units as follows. IfFhi = 0 or Fh(i�1) > Fhi , thenThi = Ihi. Otherwise,Thi = Ihi + 1 with probability

(Fhi � Fh(i�1)) = (1� Fh(i�1))

And the number of hits for uniti equalsThi � Th(i�1).

If Th(i�1) = Ih(i�1) + 1 , Chromy’s method determines the total number of hits forthe first i units as follows. IfFhi = 0, thenThi = Ihi. If Fhi > Fh(i�1) , thenThi = Ihi + 1. Otherwise,Thi = Ihi + 1 with probability

Fhi = Fh(i�1)

Brewer’s PPS MethodBrewer’s method (METHOD=PPS–BREWER) selects two units from each stratum,with probability proportional to size and without replacement. The selection proba-bility for unit i in stratumh equals2Mhi=Mh� = 2Zhi .

SAS OnlineDoc: Version 8

Page 33: The SURVEYSELECT Procedure

Murthy’s PPS Method � 3305

Brewer’s algorithm first selects a unit with probability

Zhi (1� Zhi)

Dh (1� 2Zhi)

where

Dh =

NhXi=1

Zhi (1� Zhi)

1� 2Zhi

Then a second unit is selected from the remaining units with probability

Zhj

1� Zhi

where uniti is the first unit selected. The joint selection probability for unitsi andjin stratumh equals

Ph(ij) =2 Zhi Zhj

Dh

�1� Zhi � Zhj

(1� 2Zhi) (1� 2Zhj)

Brewer’s method requires that the relative sizeZhi be less than 0.5 for all units.Refer to Cochran (1977, pp. 261–263) and Brewer (1963). Brewer’s method yieldsthe same selection probabilities and joint selection probabilities as Durbin’s method.Refer to Cochran (1977) and Durbin (1967).

Murthy’s PPS MethodMurthy’s method (METHOD=PPS–MURTHY) selects two units from each stratum,with probability proportional to size and without replacement. The selection proba-bility for unit i in stratumh equals

Phi = Zhi [1 +K � (Zhi=(1 � Zhi)]

whereZhi = Mhi=Mh� and

K =

NXj=1

[Zhj=(1� Zhj)]

Murthy’s algorithm first selects a unit with probabilityZhi . Then a second unit isselected from the remaining units with probabilityZhj=(1�Zhi), where uniti is thefirst unit selected. The joint selection probability for unitsi andj in stratumh equals

Ph(ij) = Zhi Zhj

2� Zhi � Zhj

(1� Zhi) (1� Zhj)

Murthy’s method requires that the relative sizeZhi be less than 0.5 for all units. Referto Cochran (1977, pp. 263–265) and Murthy (1957).

SAS OnlineDoc: Version 8

Page 34: The SURVEYSELECT Procedure

3306 � Chapter 63. The SURVEYSELECT Procedure

Sampford’s PPS MethodSampford’s method (METHOD=PPS–SAMPFORD) is an extension of Brewer’smethod that selects more than two units from each stratum, with probability propor-tional to size and without replacement. The selection probability for uniti in stratumh equals

Phi = nhMhi

Mh�

= nh Zhi

Sampford’s method first selects a unit from stratumh with probabilityZhi . Thensubsequent units are selected with probability proportional to

Zhi

1� nh Zhi

and with replacement. If the same unit appears more than once in the sample of sizenh, then Sampford’s algorithm rejects that sample and selects a new sample. Thesample is accepted if it containsnh distinct units.

The joint selection probability for unitsi andj in stratumh equals

Ph(ij) = Kh �i �j

nhXt=2

�t� nh (Phi + Phj) Lnh�t(ij)

�= nt�2h

where

�i =Zhi

1� nh Zhi

Lm =XS(m)

�i1�i2 � � � �im

whereS(m) denotes all possible samples of sizem, for m = 1; 2; : : : ; Nh . The sumLm(ij) is defined similarly toLm but sums over all possible samples of sizem thatdo not include unitsi andj, and

Kh =

nhXt=1

t Lnh�t = nth

!�1

Sampford’s method requires that the relative sizeZhi be less than1=nh for all units.Refer to Cochran (1977, pp. 262–263) and Sampford (1967).

Output Data Set

PROC SURVEYSELECT creates a SAS data set that contains the sample of selectedunits. You can specify the name of this output data set with the OUT= option in thePROC SURVEYSELECT statement. If you omit the OUT= option, the data set isnamed DATAn, wheren is the smallest integer that makes the name unique.

SAS OnlineDoc: Version 8

Page 35: The SURVEYSELECT Procedure

Output Data Set � 3307

The output data set contains an observation for each unit selected for the sample. Ifyou specify the OUTHITS option for methods that may select the same unit morethan once (that is, methods that select with replacement or with minimum replace-ment), the output data set contains a separate observation for each selection. If you donot specify the OUTHITS option, the output data set contains only one observationfor each selected unit, even if the unit is selected more than once, and the variableNumberHits contains the number of hits or selections for that unit.

The output data set contains design information and selection statistics, depending onthe selection method and output options you specify. The output data set can includethe following variables:

� STRATA variables

� Replicate, which is the sample replicate number. This variable is includedwhen you request replicated sampling with the REP= option.

� ID variables

� CONTROL variables

� Zone, which is the selection zone. This variable is included forMETHOD=PPS–SEQ.

� SIZE variable

� AdjustedSize, which is the adjusted size measure. This variable is includedif you request adjusted sizes with the MINSIZE= option or the MAXSIZE=option.

� Certain, which indicates certainty selection. This variable is included if youspecify the CERTSIZE= option. It equals 1 for units included with certaintybecause their size measures exceed the certainty size measure. Otherwise, itequals 0.

� NumberHits, which is the number of hits or selections. This variable isincluded for selection methods that are with replacement or with minimumreplacement (METHOD=URS, METHOD=PPS–WR, METHOD=PPS–SYS,and METHOD=PPS–SEQ).

The output data set includes the following variables if you request a PPS selectionmethod or if you specify the STATS option for other methods:

� ExpectedHits, which is the expected number of hits or selections. Thisvariable is included for selection methods that are with replacementor with minimum replacement (METHOD=URS, METHOD=PPS–WR,METHOD=PPS–SYS, and METHOD=PPS–SEQ).

� SelectionProb, which is the probability of selection. This variable is includedfor selection methods that are without replacement.

� SamplingWeight, which is the sampling weight. This variable equals the in-verse ofExpectedHits or SelectionProb.

SAS OnlineDoc: Version 8

Page 36: The SURVEYSELECT Procedure

3308 � Chapter 63. The SURVEYSELECT Procedure

For METHOD=PPS–BREWER and METHOD=PPS–MURTHY, which select twounits from each stratum with probability proportional to size, the output data set con-tains the following variable:

� JtSelectionProb, which is the joint probability of selection for the two unitsselected from the stratum

If you request the JTPROBS option to compute joint probabilities of selection forMETHOD=PPS or METHOD=PPS–SAMPFORD, then the output data set containsthe following variables:

� Unit, which is an identification variable that numbers the selected units sequen-tially within each stratum

� JtProb–1, JtProb–2, JtProb–3, . . . , where the variableJtProb–1 containsthe joint probability of selection for the current unit and unit 1. Similarly,JtProb–2 contains the joint probability of selection for the current unit andunit 2, and so on.

If you request the JTPROBS option for METHOD=PPS–WR, then the output dataset contains the following variables:

� Unit, which is an identification variable that numbers the selected units sequen-tially within each stratum

� JtHits–1, JtHits–2, JtHits–3, . . . , where the variableJtHits–1 contains thejoint expected number of hits for the current unit and unit 1. Similarly,JtHits–2 contains the joint expected number of hits for the current unit andunit 2, and so on.

If you request the OUTSIZE option, the output data set contains the following vari-ables. If you specify a STRATA statement, the output data set includes stratum-levelvalues of these variables. Otherwise, the output data set contains population-levelvalues of these variables.

� MinimumSize, which is the minimum size measure specified with the MIN-SIZE= option. This variable is included if you request the MINSIZE= option.

� MaximumSize, which is the maximum size measure specified with the MAX-SIZE= option. This variable is included if you request the MAXSIZE= option.

� CertaintySize, which is the certainty size measure specified with the CERT-SIZE= option. This variable is included if you request the CERTSIZE= option.

� Total, which is the total number of sampling units in the stratum. This variableis included if there is no SIZE statement.

� TotalSize, which is the total of size measures in the stratum. This variable isincluded if there is a SIZE statement.

SAS OnlineDoc: Version 8

Page 37: The SURVEYSELECT Procedure

Displayed Output � 3309

� TotalAdjSize, which is the total of adjusted size measures in the stratum. Thisvariable is included if there is a SIZE statement and if you request adjustedsizes with the MAXSIZE= option or the MINSIZE= option.

� SamplingRate, which is the sampling rate. This variable is included if youspecify the SAMPRATE= option.

� SampleSize, which is the sample size. This variable is included if youspecify the SAMPSIZE= option, or if you specify METHOD=BREWER orMETHOD=MURTHY, which select two units from each stratum.

Displayed Output

By default, PROC SURVEYSELECT displays two tables that summarize the sampleselection. You can suppress display of these tables by using the NOPRINT option.

PROC SURVEYSELECT creates an output data set that contains the units selectedfor the sample. The procedure does not display this output data set. Use PROCPRINT, PROC REPORT, or any other SAS reporting tool to display the output dataset.

PROC SURVEYSELECT displays the following information in the “Sample Selec-tion Method” table:

� Selection Method

� Size Measure variable, if you specify a SIZE statement

� Minimum Size Measure, if you specify the MINSIZE=minoption

� Maximum Size Measure, if you specify the MAXSIZE=maxoption

� Certainty Size Measure, if you specify the CERTSIZE=certainoption

� Strata Variables, if you specify a STRATA statement

� Control Variables, if you specify a CONTROL statement

� type of Control Sorting, Serpentine or Nested, if you specify a CONTROLstatement

PROC SURVEYSELECT displays the following information in the “Sample Selec-tion Summary” table:

� Input Data Set name

� Sorted Data Set name, if you specify the OUTSORT= option

� Random Number Seed

� Sample Size or Stratum Sample Size, if you specify the SAMPSIZE=n option

� Sample Size Data Set, if you specify the SAMPSIZE=SAS-data-setoption

� Sampling Rate or Stratum Sampling Rate, if you specify the SAMPRATE=roption

� Sampling Rate Data Set, if you specify the SAMPRATE=SAS-data-setoption

SAS OnlineDoc: Version 8

Page 38: The SURVEYSELECT Procedure

3310 � Chapter 63. The SURVEYSELECT Procedure

� Minimum Sample Size or Stratum Minimum Sample Size, if you specify theNMIN= option with the SAMPRATE= option

� Maximum Sample Size or Stratum Maximum Sample Size, if you specify theNMAX= option with the SAMPRATE= option

� Selection Probability, if you specify METHOD=SRS, METHOD=SYS, orMETHOD=SEQ and do not specify a STRATA statement

� Expected Number of Hits, if you specify METHOD=URS and do not specifya STRATA statement

� Sampling Weight for equal probability selection methods (METHOD=SRS,METHOD=URS, METHOD=SYS, METHOD=SEQ) if you do not specify aSTRATA statement

� Number of Strata, if you specify a STRATA statement

� Number of Replicates, if you specify the REP= option

� Total Sample Size, if you specify a STRATA statement or the REP= option

� Output Data Set name

ODS Table Names

PROC SURVEYSELECT assigns a name to each table it creates. You can use thesenames to reference the table when using the Output Delivery System (ODS) to selecttables and create output data sets. These names are listed in the following table. Formore information on ODS, see Chapter 15, “Using the Output Delivery System.”

Table 63.2. ODS Tables Produced in PROC SURVEYSELECT

ODS Table Name Description Statement OptionMethod Sample selection method PROC defaultSummary Sample selection summary PROC default

Examples

Example 63.1. Replicated Sampling

This example uses theCustomers data set from the section “Getting Started” onpage 3276. The data setCustomers contains an Internet service provider’s currentsubscribers, and the service provider wants to select a sample from this populationfor a customer satisfaction survey.

This example illustrates replicated sampling, which selects multiple samples from thesurvey population according to the same design. You can use replicated sampling toprovide a simple method of variance estimation, or to evaluate variable nonsamplingerrors such as interviewer differences. Refer to Kish (1965), Kish (1987), and Kalton(1983) for information on replicated sampling.

This design includes 4 replicates, each with a sample size of 50 customers. Thesampling frame is stratified byState and sorted byType andUsage within strata.

SAS OnlineDoc: Version 8

Page 39: The SURVEYSELECT Procedure

Example 63.1. Replicated Sampling � 3311

Customers are selected by sequential random sampling with equal probability withinstrata. The following PROC SURVEYSELECT statements select a probability sam-ple of customers from theCustomers data set using this design.

title1 ’Customer Satisfaction Survey’;title2 ’Replicated Sampling’;proc surveyselect data=Customers method=seq

rep=4 n=(8 12 20 10)seed=40070 out=SampleRep;

strata State;control Type Usage;

run;

The STRATA statement names the stratification variableState. The CON-TROL statement names the control variablesType and Usage. In the PROCSURVEYSELECT statement, the METHOD=SEQ option requests sequential ran-dom sampling. The REP=4 option specifies 4 replicates of this sample. The N=(8 1220 10) option specifies the stratum sample sizes for each replicate. The N= optionlists the stratum sample sizes in the same order as the strata appear in theCustomersdata set, which has been sorted byState. The sample size of 8 customers corre-sponds to the first stratum,State = ‘AL’. The sample size 12 corresponds to the nextstratum,State = ‘FL’, and so on. The SEED=40070 option specifies ’40070’ as theinitial seed for random number generation.

Figure 63.1.1 displays the output from PROC SURVEYSELECT, which summarizesthe sample selection. A total of 200 customers is selected in 4 replicates. PROCSURVEYSELECT selects each replicate using sequential random sampling withinstrata determined byState. The sampling frameCustomers is sorted by controlvariablesType andUsage within strata, according to hierarchic serpentine sorting.The output data setSampleRep contains the sample.

Output 63.1.1. Sample Selection Summary

Customer Satisfaction SurveyReplicated Sampling

The SURVEYSELECT Procedure

Selection Method Sequential Random SamplingWith Equal Probability

Strata Variable StateControl Variables Type

UsageControl Sorting Serpentine

Input Data Set CUSTOMERSRandom Number Seed 40070Number of Strata 4Number of Replicates 4Total Sample Size 200Output Data Set SAMPLEREP

SAS OnlineDoc: Version 8

Page 40: The SURVEYSELECT Procedure

3312 � Chapter 63. The SURVEYSELECT Procedure

The following PROC PRINT statements display the selected customers for the firststratum,State = ‘AL’, from the output data setSampleRep.

title1 ’Customer Satisfaction Survey’;title2 ’Sample Selected by Replicated Design’;title3 ’(First Stratum)’;proc print data=SampleRep;

where State = ’AL’;run;

Figure 63.1.2 displays the 32 sample customers of the first stratum (State = ‘AL’)from the output data setSampleRep, which includes the entire sample of 200 cus-tomers. The variableSelectionProb contains the selection probability, andSam-plingWeight contains the sampling weight. Since customers are selected with equalprobability within strata in this design, all customers in the same stratum have thesame selection probability. These selection probabilities and sampling weights applyto a single replicate, and the variableReplicate contains the sample replicate number.

Output 63.1.2. Customer Sample (First Stratum)

Customer Satisfaction SurveySample Selected by Replicated Design

(First Stratum)

Selection SamplingObs State Replicate CustomerID Type Usage Prob Weight

1 AL 1 882-37-7496 New 572 .004115226 2432 AL 1 581-32-5534 New 863 .004115226 2433 AL 1 980-29-2898 Old 571 .004115226 2434 AL 1 172-56-4743 Old 128 .004115226 2435 AL 1 998-55-5227 Old 35 .004115226 2436 AL 1 625-44-3396 New 60 .004115226 2437 AL 1 627-48-2509 New 114 .004115226 2438 AL 1 257-66-6558 New 172 .004115226 2439 AL 2 622-83-1680 New 22 .004115226 243

10 AL 2 343-57-1186 New 53 .004115226 24311 AL 2 976-05-3796 New 110 .004115226 24312 AL 2 859-74-0652 New 303 .004115226 24313 AL 2 476-48-1066 New 839 .004115226 24314 AL 2 109-27-8914 Old 2102 .004115226 24315 AL 2 743-25-0298 Old 376 .004115226 24316 AL 2 722-08-2215 Old 105 .004115226 24317 AL 3 668-57-7696 New 200 .004115226 24318 AL 3 300-72-0129 New 471 .004115226 24319 AL 3 073-60-0765 New 656 .004115226 24320 AL 3 526-87-0258 Old 672 .004115226 24321 AL 3 726-61-0387 Old 150 .004115226 24322 AL 3 632-29-9020 Old 51 .004115226 24323 AL 3 417-17-8378 New 56 .004115226 24324 AL 3 091-26-2366 New 93 .004115226 24325 AL 4 336-04-1288 New 419 .004115226 24326 AL 4 827-04-7407 New 650 .004115226 24327 AL 4 317-70-6496 Old 452 .004115226 24328 AL 4 002-38-4582 Old 206 .004115226 24329 AL 4 181-83-3990 Old 33 .004115226 24330 AL 4 675-34-7393 New 47 .004115226 24331 AL 4 228-07-6671 New 65 .004115226 24332 AL 4 298-46-2434 New 161 .004115226 243

SAS OnlineDoc: Version 8

Page 41: The SURVEYSELECT Procedure

Example 63.2. PPS Selection of Two Units Per Stratum � 3313

Example 63.2. PPS Selection of Two Units Per Stratum

A state health agency plans to conduct a state-wide survey of a variety of differenthospital services. The agency plans to select a probability sample of individual dis-charge records within hospitals using a two-stage sample design. First stage units arehospitals, and second stage units are patient discharges during the study time period.Hospitals are stratified first according to geographic region and then by rural/urbantype and size of hospital. Two hospitals are selected from each stratum with proba-bility proportional to size. This example describes hospital selection for this surveyusing PROC SURVEYSELECT.

The data setHospitalFrame contains all hospitals in the first geographical region ofthis state.

data HospitalFrame;input Hospital$ Type$ SizeMeasure @@;if (SizeMeasure < 20) then Size=’Small ’;

else if (SizeMeasure < 50) then Size=’Medium’;else Size = ’Large ’;

datalines;034 Rural 0.870 107 Rural 1.316079 Rural 2.127 223 Rural 3.960236 Rural 5.279 165 Rural 5.893086 Rural 0.501 141 Rural 11.528042 Urban 3.104 124 Urban 4.033006 Urban 4.249 261 Urban 4.376195 Urban 5.024 190 Urban 10.373038 Urban 17.125 083 Urban 40.382259 Urban 44.942 129 Urban 46.702133 Urban 46.992 218 Urban 48.231026 Urban 61.460 058 Urban 65.931119 Urban 66.352;

In the SAS data setHospitalFrame, the variableHospital identifies the hospital.The variableType equals ‘Urban’ if the hospital is located in an urbanized area,and ‘Rural’ otherwise. The variableSizeMeasure contains the hospital’s size mea-sure, which is constructed from past data on service utilization for the hospital to-gether with the desired sampling rates for each service. This size measure reflects theamount of relevant survey information expected from the hospital. Refer to Drum-mond et al. (1982) for details on this type of size measure. The variableSize equals‘Small’, ‘Medium’, or ‘Large’, depending on the value of the hospital’s size measure.

The following PROC PRINT statements display the data setHospital Frame.

title1 ’Hospital Utilization Survey’;title2 ’Sampling Frame, Region 1’;proc print data=HospitalFrame;run;

SAS OnlineDoc: Version 8

Page 42: The SURVEYSELECT Procedure

3314 � Chapter 63. The SURVEYSELECT Procedure

Output 63.2.1. Sampling Frame

Hospital Utilization SurveySampling Frame, Region 1

SizeObs Hospital Type Measure Size

1 034 Rural 0.870 Small2 107 Rural 1.316 Small3 079 Rural 2.127 Small4 223 Rural 3.960 Small5 236 Rural 5.279 Small6 165 Rural 5.893 Small7 086 Rural 0.501 Small8 141 Rural 11.528 Small9 042 Urban 3.104 Small

10 124 Urban 4.033 Small11 006 Urban 4.249 Small12 261 Urban 4.376 Small13 195 Urban 5.024 Small14 190 Urban 10.373 Small15 038 Urban 17.125 Small16 083 Urban 40.382 Medium17 259 Urban 44.942 Medium18 129 Urban 46.702 Medium19 133 Urban 46.992 Medium20 218 Urban 48.231 Medium21 026 Urban 61.460 Large22 058 Urban 65.931 Large23 119 Urban 66.352 Large

The following PROC SURVEYSELECT statements select a probability sample ofhospitals from theHospitalFrame data set, using a stratified design with PPS selec-tion of two units from each stratum.

title1 ’Hospital Utilization Survey’;proc surveyselect data=HospitalFrame method=pps_brewer

seed=48702 out=SampleHospitals;size SizeMeasure;strata Type Size notsorted;run;

The STRATA statement names the stratification variablesType andSize. The NOT-SORTED option specifies that observations with the same STRATA variable valuesare grouped together but are not necessarily sorted in alphabetical or increasing nu-merical order. In theHospitalFrame data set,Size = ‘Small’ precedesSize =‘Medium’.

In the PROC SURVEYSELECT statement, the METHOD=PPS–BREWER optionrequests sample selection by Brewer’s method, which selects two units per stratumwith probability proportional to size. The SEED=48702 option specifies ’48702’ asthe initial seed for random number generation. The SIZE statement specifies the sizemeasure variable. It is not necessary to specify the sample size with the N= option,since Brewer’s method always selects two units from each stratum.

Figure 63.2.2 displays the output from PROC SURVEYSELECT. A total of 8 hos-pitals were selected from the 4 strata. The data setSampleHospitals contains theselected hospitals.

SAS OnlineDoc: Version 8

Page 43: The SURVEYSELECT Procedure

Example 63.3. PPS (Dollar-Unit) Sampling � 3315

Output 63.2.2. Sample Selection Summary

Hospital Utilization Survey

The SURVEYSELECT Procedure

Selection Method Brewer’s PPS MethodSize Measure SizeMeasureStrata Variables Type

Size

Input Data Set HOSPITALFRAMERandom Number Seed 48702Stratum Sample Size 2Number of Strata 4Total Sample Size 8Output Data Set SAMPLEHOSPITALS

The following PROC PRINT statements display the sample hospitals.

title1 ’Hospital Utilization Survey’;title2 ’Sample Hospitals, Region 1’;proc print data=SampleHospitals;run;

Output 63.2.3. Sample Hospitals

Hospital Utilization SurveySample Selected by Stratified PPS Design

JtSize Selection Sampling Selection

Obs Type Size Hospital Measure Prob Weight Prob

1 Rural Small 079 2.127 0.13516 7.39868 0.018512 Rural Small 236 5.279 0.33545 2.98106 0.018513 Urban Small 006 4.249 0.17600 5.68181 0.014544 Urban Small 195 5.024 0.20810 4.80533 0.014545 Urban Medium 133 46.992 0.41357 2.41795 0.113056 Urban Medium 218 48.231 0.42448 2.35584 0.113057 Urban Large 026 61.460 0.63445 1.57617 0.315058 Urban Large 058 65.931 0.68060 1.46929 0.31505

The variableSelectionProb contains the selection probability for each hospital in thesample. The variableJtSelectionProb contains the joint probability of selection forthe two sample hospitals in the same stratum. The variableSamplingWeight con-tains the sampling weight component for this first stage of the design. The final-stageweight components, which correspond to patient record selection within hospitals,can be multiplied by the hospital weight components to obtain the overall samplingweights.

Example 63.3. PPS (Dollar-Unit) Sampling

A small company wants to audit employee travel expenses in an effort to improve theexpense reporting procedure and possibly reduce expenses. The company does nothave resources to examine all expense reports and wants to use statistical sampling toobjectively select expense reports for audit.

SAS OnlineDoc: Version 8

Page 44: The SURVEYSELECT Procedure

3316 � Chapter 63. The SURVEYSELECT Procedure

The data setTravelExpense contains the dollar amount of all employee travel ex-pense transactions during the past month.

data TravelExpense;input ID$ Amount @@;if (Amount < 500) then Level=’1_Low ’;

else if (Amount > 1500) then Level=’3_High’;else Level=’2_Avg ’;

datalines;110 237.18 002 567.89 234 118.50743 74.38 411 1287.23 782 258.10216 325.36 174 218.38 568 1670.80302 134.71 285 2020.70 314 47.80139 1183.45 775 330.54 425 780.10506 895.80 239 620.10 011 420.18672 979.66 142 810.25 738 670.85192 314.58 243 87.50 263 1893.40496 753.30 332 540.65 486 2580.35614 230.56 654 185.60 308 688.43784 505.14 017 205.48 162 650.42289 1348.34 691 30.50 545 2214.80517 940.35 382 217.85 024 142.90478 806.90 107 560.72;

In the SAS data setTravelExpense, the variableID identifies the travel expensereport. The variableAmount contains the dollar amount of the reported expense.The variableLevel equals ‘1–Low’, ‘2–Avg’, or ‘3–High’, depending on the valueof Amount.

In the sample design for this audit, expense reports are stratified byLevel. Thisensures that each of these expense levels is included in the sample and also permitsa disproportionate allocation of the sample, selecting proportionately more of theexpense reports from the higher levels. Within strata, the sample of expense reportsis selected with probability proportional to the amount of the expense, thus givinga greater chance of selection to larger expenses. In auditing terms, this is known asmonetary-unit sampling. Refer to Wilburn (1984).

PROC SURVEYSELECT requires that the input data set be sorted by the STRATAvariables. The following PROC SORT statements sort theTravelExpense data setby the stratification variableLevel.

proc sort data=TravelExpense;by Level;

run;

The following PROC PRINT statements display the sampling frame data setTravel-Expense, which contains 41 observations.

title1 ’Travel Expense Audit’;proc print data=TravelExpense;run;

SAS OnlineDoc: Version 8

Page 45: The SURVEYSELECT Procedure

Example 63.3. PPS (Dollar-Unit) Sampling � 3317

Output 63.3.1. Sampling Frame

Travel Expense Audit

Obs ID Amount Level

1 110 237.18 1_Low2 234 118.50 1_Low3 743 74.38 1_Low4 782 258.10 1_Low5 216 325.36 1_Low6 174 218.38 1_Low7 302 134.71 1_Low8 314 47.80 1_Low9 775 330.54 1_Low

10 011 420.18 1_Low11 192 314.58 1_Low12 243 87.50 1_Low13 614 230.56 1_Low14 654 185.60 1_Low15 017 205.48 1_Low16 691 30.50 1_Low17 382 217.85 1_Low18 024 142.90 1_Low19 002 567.89 2_Avg20 411 1287.23 2_Avg21 139 1183.45 2_Avg22 425 780.10 2_Avg23 506 895.80 2_Avg24 239 620.10 2_Avg25 672 979.66 2_Avg26 142 810.25 2_Avg27 738 670.85 2_Avg28 496 753.30 2_Avg29 332 540.65 2_Avg30 308 688.43 2_Avg31 784 505.14 2_Avg32 162 650.42 2_Avg33 289 1348.34 2_Avg34 517 940.35 2_Avg35 478 806.90 2_Avg36 107 560.72 2_Avg37 568 1670.80 3_High38 285 2020.70 3_High39 263 1893.40 3_High40 486 2580.35 3_High41 545 2214.80 3_High

The following PROC SURVEYSELECT statements select a probability sample ofexpense reports from theTravelExpense data set using the stratified design withPPS selection within strata.

title1 ’Travel Expense Audit’;proc surveyselect data=TravelExpense method=pps

n=(6 10 4)seed=47279 out=AuditSample;

size Amount;strata Level;run;

The STRATA statement names the stratification variableLevel. The SIZE state-ment specifies the size measure variableAmount. In the PROC SURVEYSELECT

SAS OnlineDoc: Version 8

Page 46: The SURVEYSELECT Procedure

3318 � Chapter 63. The SURVEYSELECT Procedure

statement, the METHOD=PPS option requests sample selection with probability pro-portional to size and without replacement. The N=(6 10 4) option specifies the stra-tum sample sizes, listing the sample sizes in the same order that the strata apear inthe TravelExpense data set. The sample size of 6 corresponds to the first stratum,Level = ‘1–Low’, the sample size of 10 corresponds to the second stratum,Level =‘2–Avg’, and 4 corresponds to the last stratum,Level = ‘3–High’. The SEED=47279option specifies ’47279’ as the initial seed for random number generation.

Figure 63.3.2 displays the output from PROC SURVEYSELECT. A total of 20 ex-pense reports is selected for audit. The data setAuditSample contains the sample oftravel expense reports.

Output 63.3.2. Sample Selection Summary

Travel Expense Audit

The SURVEYSELECT Procedure

Selection Method PPS, Without ReplacementSize Measure AmountStrata Variable Level

Input Data Set TRAVELEXPENSERandom Number Seed 47279Number of Strata 3Total Sample Size 20Output Data Set AUDITSAMPLE

The following PROC PRINT statements display the audit sample.

title1 ’Travel Expense Audit Sample’;title2 ’Sample Selected by Stratified PPS Design’;proc print data=AuditSample;run;

SAS OnlineDoc: Version 8

Page 47: The SURVEYSELECT Procedure

References � 3319

Output 63.3.3. Audit Sample

Travel Expense AuditSample Selected by Stratified PPS Design

Selection SamplingObs Level ID Amount Prob Weight

1 1_Low 654 185.60 0.31105 3.214892 1_Low 017 205.48 0.34437 2.903853 1_Low 382 217.85 0.36510 2.738964 1_Low 614 230.56 0.38640 2.587975 1_Low 782 258.10 0.43256 2.311836 1_Low 775 330.54 0.55396 1.805187 2_Avg 784 505.14 0.34623 2.888238 2_Avg 332 540.65 0.37057 2.698539 2_Avg 002 567.89 0.38924 2.56909

10 2_Avg 239 620.10 0.42503 2.3527811 2_Avg 738 670.85 0.45981 2.1747912 2_Avg 496 753.30 0.51633 1.9367613 2_Avg 425 780.10 0.53470 1.8702214 2_Avg 478 806.90 0.55307 1.8081015 2_Avg 672 979.66 0.67148 1.4892516 2_Avg 139 1183.45 0.81116 1.2328017 3_High 568 1670.80 0.64385 1.5531618 3_High 263 1893.40 0.72963 1.3705619 3_High 285 2020.70 0.77869 1.2842120 3_High 486 2580.35 0.99435 1.00568

References

Bentley, J.L. and Floyd, R. (1987), “A Sample of Brilliance,”Communications of theAssociation for Computing Machinery, 30, 754–757.

Bentley, J.L. and Knuth, D. (1986), “Literate Programming,”Communications of theAssociation for Computing Machinery, 29, 364–369.

Brewer, K.W.R. (1963), “A Model of Systematic Sampling with Unequal Probabili-ties,” Australian Journal of Statistics, 5, 93–105.

Chromy, J.R. (1979), “Sequential Sample Selection Methods,”Proceedings of theAmerican Statistical Association, Survey Research Methods Section, 401–406.

Cochran, W.G. (1977),Sampling Techniques, Third Edition, New York: John Wiley& Sons, Inc.

Drummond, D., Lessler, J., Watts, D., and Williams, S. (1982), “A Design for Achiev-ing Prespecified Levels of Representation for Multiple Domains in Health RecordSamples,”Proceedings of the Fourth Conference on Health Survey ResearchMethods, DHHS Publication No. (PHS) 84-3346, Washington, D.C.: NationalCenter for Health Services Research, 233–248.

Durbin, J. (1967), “Design of Multi-stage Surveys for the Estimation of SamplingErrors,”Applied Statistics, 16, 152–164.

Fan, C.T., Muller, M.E., and Rezucha, I. (1962), “Development of Sampling Plans byUsing Sequential (Item by Item) Selection Techniques and Digital Computers,”Journal of the American Statistical Association, 57, 387–402.

SAS OnlineDoc: Version 8

Page 48: The SURVEYSELECT Procedure

3320 � Chapter 63. The SURVEYSELECT Procedure

Fox, D.R. (1989), “Computer Selection of Size-Biased Samples,”The AmericanStatistician, 43(3), 168–171.

Golmant, J. (1990), “Correction: Computer Selection of Size-Biased Samples,”TheAmerican Statistician, 44(2), 194.

Hanurav, T.V. (1967), “Optimum Utilization of Auxiliary Information:�ps Samplingof Two Units from a Stratum,”Journal of the Royal Statistical Society, Series B,29, 374–391.

Kalton, G. (1983),Introduction to Survey Sampling, Sage University Paper series onQuantitative Applications in the Social Sciences, series no. 07-035, Beverly Hillsand London: Sage Publications, Inc.

Kish, L. (1965),Survey Sampling, New York: John Wiley & Sons, Inc.

Kish, L. (1987),Statistical Design for Research, New York: John Wiley & Sons, Inc.

Madow, W.G. (1949), “On the Theory of Systematic Sampling, II,”Annals of Math-ematical Statistics, 20, 333–354.

McLeod, A.I. and Bellhouse, D.R. (1983), “A Convenient Algorithm for Drawing aSimple Random Sample,”Applied Statistics, 32, 182–183.

Murthy, M.N. (1957), “Ordered and Unordered Estimators in Sampling Without Re-placement,”Sankhya, 18, 379–390.

Murthy, M.N. (1967), Sampling Theory and Methods, Calcutta, India: StatisticalPublishing Society.

Sampford, M.R. (1967), “On Sampling without Replacement with Unequal Probabil-ities of Selection,”Biometrika, 54, 499–513.

Vijayan, K. (1968), “An Exact�ps Sampling Scheme: Generalization of a Method ofHanurav,”Journal of the Royal Statistical Society, Series B, 30, 556–566.

Watts, D.L. (1991), “Correction: Computer Selection of Size-Biased Samples,”TheAmerican Statistician, 45(2), 172.

Willburn, A.J. (1984),Practical Statistical Sampling for Auditors, New York: MarcelDekker, Inc.

Williams, R.L. and Chromy, J.R. (1980), “SAS Sample Selection Macros,”Proceed-ings of the Fifth Annual SAS Users Group International Conference, 5, 392–396.

SAS OnlineDoc: Version 8

Page 49: The SURVEYSELECT Procedure

The correct bibliographic citation for this manual is as follows: SAS Institute Inc.,SAS/STAT ® User’s Guide, Version 8, Cary, NC: SAS Institute Inc., 1999.

SAS/STAT® User’s Guide, Version 8Copyright © 1999 by SAS Institute Inc., Cary, NC, USA.ISBN 1–58025–494–2All rights reserved. Produced in the United States of America. No part of this publicationmay be reproduced, stored in a retrieval system, or transmitted, in any form or by anymeans, electronic, mechanical, photocopying, or otherwise, without the prior writtenpermission of the publisher, SAS Institute Inc.U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of thesoftware and related documentation by the U.S. government is subject to the Agreementwith SAS Institute and the restrictions set forth in FAR 52.227–19 Commercial ComputerSoftware-Restricted Rights (June 1987).SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.1st printing, October 1999SAS® and all other SAS Institute Inc. product or service names are registered trademarksor trademarks of SAS Institute Inc. in the USA and other countries.® indicates USAregistration.Other brand and product names are registered trademarks or trademarks of theirrespective companies.The Institute is a private company devoted to the support and further development of itssoftware and related services.