introductiontosampling-131029161007-phpapp02

8/12/2019 introductiontosampling-131029161007-phpapp02

1/22

Introduction toSampling

Situo Liu

Spry, Inc.

10/25/2013


2/22

Ways to deal with Big Data

Big Analytics - use distributed database systems

(hadoop) and parallel programming

(MapReduce)

Sampling - use the representative sampleestimate the population

Sampling in Hadoop

Hadoop isnt the king of interactive analysis Sampling is a good way to grab a set of data then

play with it locally (R or Excel)

Pig has a handy SAMPLE keyword


3/22

Elements of a Sample

Sample - a subset of individuals within a statistical population to

estimate characteristics of the whole population.

Target Population - collection of observations we want to study

Sampled Population - all possible observation units that might

have been sampled Sampling Frame - list of all sampling units (student roster, list of

phone number)

Sampling Unit - unit we actually sample (e.g. household)

Observational Unit - element to be measured (e.g. individualpeople in the household)


4/22

Sampling Techniques (1)

Probability Sampling

Every unit in the population has a chance (greater than zero) of

being selected in the sample, and this probability can be

accurately determined.

Not every observational unit has to have the same probability ofselection but every observational units probability is known.

Nonprobability Sampling

Some elements of the population have no chance of selection

(these are sometimes referred to as 'out of coverage'), or where

the probability of selection can't be accurately determined.

Because the selection of elements is nonrandom, nonprobability

sampling does not allow the estimation of sampling errors.


5/22

Sampling Techniques (2)

Probability Sampling

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Cluster or Multistage Sampling

Probability Proportional to Size Sampling

Panel sampling

Nonprobability Sampling

Accidental sampling / Convenience sampling / Haphazard

Quota sampling

Purposive sampling / Judgmental sampling

Capture-Recapture sampling (determine population size)

Line-intercept samplinghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svg
http://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svg


6/22

Simple Random Sampling - SRS

Definition: for a size n simple random sample, every possible

subset of n units in the population has the same chance of

being in the sample

Requirement: One unique identifier is needed for

implementation Advantage: easy to understand and implement

Disadvantage: biggest variance, least accuracy


7/22

Systematic Sampling

Definition: Systematic sampling relies on arranging the study

population according to some ordering scheme and then

selecting elements at regular intervals through that ordered

list. Systematic sampling involves a random start and then

proceeds with the selection of every kth (k=populationsize/sample size) element from then onwards.

Requirement: Ordering scheme for population

Advantage: easy to implement, very efficient

Disadvantage: vulnerable to periodicities


8/22

Stratified sampling (1)

Definition: Where the population embraces a number of

distinct categories, the frame can be organized by these

categories into separate "strata." Each stratum is then

sampled as an independent sub-population, out of which

individual elements can be randomly selected. Requirement: population can be divided into distinct,

independent strata, provided that strata are selected based

upon relevance to the criterion in question

Variability within strata are minimized

Variability between strata are maximized

The variables upon which the population is stratified are

strongly correlated with the desired dependent variable.


9/22

Stratified sampling (2)

Advantage:

Inferences can be done about specific subgroup

Very likely more efficient statistical estimates

will never result in less efficiency than SRS, provided that each

stratum is proportional to the group's size in the population. Data maybe more readily for individual pre-existing strata within

a population than for the overall population

Because strata are independent, different approaches for

subgroups

Disadvantage:

Complexity in implementation and estiamtion

Multiple criteria can be tricky

Specified minimum sample size per group


10/22

Cluster Sampling (1)

Definition: where the entire population is divided into groups,

or clusters, and a random sample of these clusters are

selected. All observations in the selected clusters are included

in the sample.

Requirement: does not require complete list of every unit inthe population, only requires sampling frame on cluster-level

Variability within cluster are maximized

Variability between cluster are minimized

The variables upon which the population is divided intoclusters are not strongly correlated with the desired

dependent variable.


11/22

Cluster Sampling (2)

Advantages:

Easy to implement

Cost-effective

Disadvantages:

Complexity in estimation

May not reflect the diversity of clusters

Provide less information per observation than SRS

Redundant information from the others in the cluster

Standard errors may be higher than other sampling designs


12/22

Probability Proportional to size

sampling - PPS Definition: Where the selection probability for each element is

set to be proportional to its size measure.

Every technique before was equal probability of selection (EPS)

Requirement: auxiliary variable / size measure, correlated to

the variable of interest Advantage:

May improve accuracy for a given sample size by concentratingsample on large elements that have the greatest impact onestimation

For business and auditing, monetary unit sampling (MUS)

Disadvantage:

Complexity for implementation and estimation

Different portions of the population may be over or underrepresented due to the probability variation in selection


13/22

Representativeness of the sample

Match between target population and

sampled population

Method of drawing sample


14/22

Two kinds of Errors

Non-sampling error - can be reduced by careful design of the survey

Selection bias - part of target population is not in sampled population(target population may not have a natural frame, the mode of datacollection may restrict frame)

Coverage Error - the extent to which the Sampling Frame does not cover

the Target population Measurement bias - measuring instrument has tendency to differ

from true value in one direction

Measurement error (Errors of Observation)

Deviations of measurement

Inaccurate measurement

Item nonresponse (didnt understand, didnt see, or refused question) Unit nonresponse (not home, not approached by interviewer, refuse call)

Sampling error - results from taking a sample instead of wholepopulation, can be quantified by statistics, reduced by increasingsample size


15/22

Sample Size Calculation

In order to know what our sample size needs to be, we must

decide in advance the maximum estimation error we are

willing to tolerate.

Determine the nature of estimationproportion or mean

The confidence level of your estimationsignificant level


16/22

Proportion (1)

Proportion: p^ = X/n

where X is the number of 'positive' observations, n is sample size

When the observations are independent, the estimator has a

binomial distribution, variance = np(1-p)

The maximum variance of this distribution is 0.25*n, whenp=0.5

For sufficiently large n, the distribution of p^ will be closely

approximated by a normal distribution. around 95% of this

distribution's probability lies within 2 standard deviations of

the mean.

will form a 95% confidence interval for the true proportion.


17/22

Proportion (2)

If this interval needs to be no more than Wunits wide, the

equation

can be solved for n, yielding n = 4/W2= 1/B2where B is the

error bound on the estimate

i.e., the estimate is usually given as within B. So,

for B = 10% one requires n = 100,

for B = 5% one needs n = 400,

for B = 3% the requirement approximates to n = 1000,

while for B = 1% a sample size of n = 10000 is required.


18/22

Mean (1)

A proportion is a special case of a mean. When estimating the

population mean using an independent and identically

distributed (iid) sample of size n, where each data value has

variance 2, the standard error of the sample mean is:

This expression describes quantitatively how the estimate

becomes more precise as the sample size increases. Using the

central limit theorem to justify approximating the sample

mean with a normal distribution yields an approximate 95%

confidence interval of the form


19/22

Mean (2)

If we wish to have a confidence interval that is Wunits in

width, we would solve

for n, yielding the sample size n= 162/W2.

i.e., if we are interested in estimating the amount by which a

drug lowers a subject's blood pressure with a confidenceinterval that is 6 units wide, and we know that the standard

deviation of blood pressure in the population is 15, then the

required sample size is 100


20/22

Stratified Sample Size (1)

The sample can often be split up into sub-samples. Typically, if

there are k such sub-samples (from k different strata) then

each of them will have a sample size ni, i = 1, 2, ..., k. These ni

must conform to the rule that n1 + n2 + ... + nk = n (i.e. that

the total sample size is given by the sum of the sub-samplesizes). Selecting these ni optimally can be done in various

ways, using (for example) Neyman's optimal allocation.

There are many reasons to use stratified sampling:[7] to

decrease variances of sample estimates, to use partly non-

random methods, or to study strata individually. A useful,partly non-random method would be to sample individuals

where easily accessible, but, where not, sample clusters to

save travel costs.


21/22

Stratified Sample Size (2)

In general, for Hstrata, a weighted sample mean is


22/22

Thank [email protected]
mailto:[email protected]://www.spryinc.com/http://www.spryinc.com/mailto:[email protected]

introductiontosampling-131029161007-phpapp02

Documents