introduction to sampling

Introduction to Sampling

Situo LiuSpry, Inc.

10/25/2013

Ways to deal with Big Data

• Big Analytics - use distributed database systems (hadoop) and parallel programming (MapReduce)• Sampling - use the representative sample

estimate the population• Sampling in Hadoop• Hadoop isn’t the king of interactive analysis• Sampling is a good way to grab a set of data then

play with it locally (R or Excel)• Pig has a handy SAMPLE keyword

Elements of a Sample• Sample - a subset of individuals within a statistical population to

estimate characteristics of the whole population. • Target Population - collection of observations we want to study• Sampled Population - all possible observation units that might

have been sampled• Sampling Frame - list of all sampling units (student roster, list of

phone number)• Sampling Unit - unit we actually sample (e.g. household)• Observational Unit - element to be measured (e.g. individual

people in the household)

Sampling Techniques (1)• Probability Sampling• Every unit in the population has a chance (greater than zero) of

being selected in the sample, and this probability can be accurately determined.

• Not every observational unit has to have the same probability of selection but every observational unit’s probability is known.

• Nonprobability Sampling• Some elements of the population have no chance of selection

(these are sometimes referred to as 'out of coverage'), or where the probability of selection can't be accurately determined.

• Because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors.

Sampling Techniques (2)• Probability Sampling• Simple Random Sampling• Systematic Sampling• Stratified Sampling• Cluster or Multistage Sampling• Probability Proportional to Size Sampling• Panel sampling

• Nonprobability Sampling• Accidental sampling / Convenience sampling / Haphazard• Quota sampling• Purposive sampling / Judgmental sampling• Capture-Recapture sampling (determine population size)• Line-intercept sampling http://

upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svg

http://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svg




Simple Random Sampling - SRS • Definition: for a size n simple random sample, every possible

subset of n units in the population has the same chance of being in the sample

• Requirement: One unique identifier is needed for implementation

• Advantage: easy to understand and implement• Disadvantage: biggest variance, least accuracy

Systematic Sampling• Definition: Systematic sampling relies on arranging the study

population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth (k=population size/sample size) element from then onwards.

• Requirement: Ordering scheme for population• Advantage: easy to implement, very efficient• Disadvantage: vulnerable to periodicities

Stratified sampling (1)• Definition: Where the population embraces a number of

distinct categories, the frame can be organized by these categories into separate "strata." Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected.

• Requirement: population can be divided into distinct, independent strata, provided that strata are selected based upon relevance to the criterion in question

• Variability within strata are minimized• Variability between strata are maximized• The variables upon which the population is stratified are

strongly correlated with the desired dependent variable.

Stratified sampling (2)• Advantage: • Inferences can be done about specific subgroup• Very likely more efficient statistical estimates

• will never result in less efficiency than SRS, provided that each stratum is proportional to the group's size in the population.

• Data maybe more readily for individual pre-existing strata within a population than for the overall population

• Because strata are independent, different approaches for subgroups

• Disadvantage: • Complexity in implementation and estiamtion• Multiple criteria can be tricky• Specified minimum sample size per group

Cluster Sampling (1) • Definition: where the entire population is divided into groups,

or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample.

• Requirement: does not require complete list of every unit in the population, only requires sampling frame on cluster-level

• Variability within cluster are maximized• Variability between cluster are minimized• The variables upon which the population is divided into

clusters are not strongly correlated with the desired dependent variable.

Cluster Sampling (2) • Advantages:• Easy to implement• Cost-effective

• Disadvantages:• Complexity in estimation• May not reflect the diversity of clusters• Provide less information per observation than SRS

• Redundant information from the others in the cluster• Standard errors may be higher than other sampling designs

Probability Proportional to size sampling - PPS• Definition: Where the selection probability for each element is

set to be proportional to its size measure. • Every technique before was equal probability of selection (EPS)

• Requirement: auxiliary variable / size measure, correlated to the variable of interest

• Advantage:• May improve accuracy for a given sample size by concentrating

sample on large elements that have the greatest impact on estimation• For business and auditing, monetary unit sampling (MUS)

• Disadvantage:• Complexity for implementation and estimation• Different portions of the population may be over or under

represented due to the probability variation in selection

Representativeness of the sample

•Match between target population and sampled population•Method of drawing sample

Two kinds of Errors• Non-sampling error - can be reduced by careful design of the survey• Selection bias - part of target population is not in sampled population

(target population may not have a natural frame, the mode of data collection may restrict frame)• Coverage Error - the extent to which the Sampling Frame does not cover

the Target population• Measurement bias - measuring instrument has tendency to differ from

true value in one direction• Measurement error (Errors of Observation) • Deviations of measurement• Inaccurate measurement• Item nonresponse (didn’t understand, didn’t see, or refused question)• Unit nonresponse (not home, not approached by interviewer, refuse call)

• Sampling error - results from taking a sample instead of whole population, can be quantified by statistics, reduced by increasing sample size

Sample Size Calculation• In order to know what our sample size needs to be, we must

decide in advance the maximum estimation error we are willing to tolerate.

• Determine the nature of estimation – proportion or mean• The confidence level of your estimation – significant level

Proportion (1)• Proportion: p^ = X/n

• where X is the number of 'positive' observations, n is sample size

• When the observations are independent, the estimator has a binomial distribution, variance = np(1-p)

• The maximum variance of this distribution is 0.25*n, when p=0.5

• For sufficiently large n, the distribution of p^ will be closely approximated by a normal distribution. around 95% of this distribution's probability lies within 2 standard deviations of the mean.

• will form a 95% confidence interval for the true proportion.

Proportion (2)• If this interval needs to be no more than W units wide, the

equation

• can be solved for n, yielding n = 4/W2 = 1/B2 where B is the error bound on the estimate

• i.e., the estimate is usually given as within ± B. So, • for B = 10% one requires n = 100, • for B = 5% one needs n = 400, • for B = 3% the requirement approximates to n = 1000, • while for B = 1% a sample size of n = 10000 is required.

Mean (1)• A proportion is a special case of a mean. When estimating the

population mean using an independent and identically distributed (iid) sample of size n, where each data value has variance σ2, the standard error of the sample mean is:

• This expression describes quantitatively how the estimate becomes more precise as the sample size increases. Using the central limit theorem to justify approximating the sample mean with a normal distribution yields an approximate 95% confidence interval of the form

Mean (2)• If we wish to have a confidence interval that is W units in

width, we would solve• for n, yielding the sample size n = 16σ2/W2.• i.e., if we are interested in estimating the amount by which a

drug lowers a subject's blood pressure with a confidence interval that is 6 units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is 100

Stratified Sample Size (1)• The sample can often be split up into sub-samples. Typically, if

there are k such sub-samples (from k different strata) then each of them will have a sample size ni, i = 1, 2, ..., k. These ni must conform to the rule that n1 + n2 + ... + nk = n (i.e. that the total sample size is given by the sum of the sub-sample sizes). Selecting these ni optimally can be done in various ways, using (for example) Neyman's optimal allocation.

• There are many reasons to use stratified sampling:[7] to decrease variances of sample estimates, to use partly non-random methods, or to study strata individually. A useful, partly non-random method would be to sample individuals where easily accessible, but, where not, sample clusters to save travel costs.

Stratified Sample Size (2)• In general, for H strata, a weighted sample mean is

Thank [email protected]

mailto:[email protected]

http://www.spryinc.com/

introduction to sampling

Technology

population sampling

cluster sampling

sampling techniques

sampling designs

systematic sampling

sampling frame list

estimation of sampling

size sampling pps definition