introductiontosampling-131029161007-phpapp02

Upload: addisu-tefera

Post on 03-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    1/22

    Introduction toSampling

    Situo Liu

    Spry, Inc.

    10/25/2013

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    2/22

    Ways to deal with Big Data

    Big Analytics - use distributed database systems

    (hadoop) and parallel programming

    (MapReduce)

    Sampling - use the representative sampleestimate the population

    Sampling in Hadoop

    Hadoop isnt the king of interactive analysis Sampling is a good way to grab a set of data then

    play with it locally (R or Excel)

    Pig has a handy SAMPLE keyword

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    3/22

    Elements of a Sample

    Sample - a subset of individuals within a statistical population to

    estimate characteristics of the whole population.

    Target Population - collection of observations we want to study

    Sampled Population - all possible observation units that might

    have been sampled Sampling Frame - list of all sampling units (student roster, list of

    phone number)

    Sampling Unit - unit we actually sample (e.g. household)

    Observational Unit - element to be measured (e.g. individualpeople in the household)

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    4/22

    Sampling Techniques (1)

    Probability Sampling

    Every unit in the population has a chance (greater than zero) of

    being selected in the sample, and this probability can be

    accurately determined.

    Not every observational unit has to have the same probability ofselection but every observational units probability is known.

    Nonprobability Sampling

    Some elements of the population have no chance of selection

    (these are sometimes referred to as 'out of coverage'), or where

    the probability of selection can't be accurately determined.

    Because the selection of elements is nonrandom, nonprobability

    sampling does not allow the estimation of sampling errors.

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    5/22

    Sampling Techniques (2)

    Probability Sampling

    Simple Random Sampling

    Systematic Sampling

    Stratified Sampling

    Cluster or Multistage Sampling

    Probability Proportional to Size Sampling

    Panel sampling

    Nonprobability Sampling

    Accidental sampling / Convenience sampling / Haphazard

    Quota sampling

    Purposive sampling / Judgmental sampling

    Capture-Recapture sampling (determine population size)

    Line-intercept samplinghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svg

    http://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svghttp://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelhoed.svg
  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    6/22

    Simple Random Sampling - SRS

    Definition: for a size n simple random sample, every possible

    subset of n units in the population has the same chance of

    being in the sample

    Requirement: One unique identifier is needed for

    implementation Advantage: easy to understand and implement

    Disadvantage: biggest variance, least accuracy

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    7/22

    Systematic Sampling

    Definition: Systematic sampling relies on arranging the study

    population according to some ordering scheme and then

    selecting elements at regular intervals through that ordered

    list. Systematic sampling involves a random start and then

    proceeds with the selection of every kth (k=populationsize/sample size) element from then onwards.

    Requirement: Ordering scheme for population

    Advantage: easy to implement, very efficient

    Disadvantage: vulnerable to periodicities

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    8/22

    Stratified sampling (1)

    Definition: Where the population embraces a number of

    distinct categories, the frame can be organized by these

    categories into separate "strata." Each stratum is then

    sampled as an independent sub-population, out of which

    individual elements can be randomly selected. Requirement: population can be divided into distinct,

    independent strata, provided that strata are selected based

    upon relevance to the criterion in question

    Variability within strata are minimized

    Variability between strata are maximized

    The variables upon which the population is stratified are

    strongly correlated with the desired dependent variable.

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    9/22

    Stratified sampling (2)

    Advantage:

    Inferences can be done about specific subgroup

    Very likely more efficient statistical estimates

    will never result in less efficiency than SRS, provided that each

    stratum is proportional to the group's size in the population. Data maybe more readily for individual pre-existing strata within

    a population than for the overall population

    Because strata are independent, different approaches for

    subgroups

    Disadvantage:

    Complexity in implementation and estiamtion

    Multiple criteria can be tricky

    Specified minimum sample size per group

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    10/22

    Cluster Sampling (1)

    Definition: where the entire population is divided into groups,

    or clusters, and a random sample of these clusters are

    selected. All observations in the selected clusters are included

    in the sample.

    Requirement: does not require complete list of every unit inthe population, only requires sampling frame on cluster-level

    Variability within cluster are maximized

    Variability between cluster are minimized

    The variables upon which the population is divided intoclusters are not strongly correlated with the desired

    dependent variable.

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    11/22

    Cluster Sampling (2)

    Advantages:

    Easy to implement

    Cost-effective

    Disadvantages:

    Complexity in estimation

    May not reflect the diversity of clusters

    Provide less information per observation than SRS

    Redundant information from the others in the cluster

    Standard errors may be higher than other sampling designs

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    12/22

    Probability Proportional to size

    sampling - PPS Definition: Where the selection probability for each element is

    set to be proportional to its size measure.

    Every technique before was equal probability of selection (EPS)

    Requirement: auxiliary variable / size measure, correlated to

    the variable of interest Advantage:

    May improve accuracy for a given sample size by concentratingsample on large elements that have the greatest impact onestimation

    For business and auditing, monetary unit sampling (MUS)

    Disadvantage:

    Complexity for implementation and estimation

    Different portions of the population may be over or underrepresented due to the probability variation in selection

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    13/22

    Representativeness of the sample

    Match between target population and

    sampled population

    Method of drawing sample

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    14/22

    Two kinds of Errors

    Non-sampling error - can be reduced by careful design of the survey

    Selection bias - part of target population is not in sampled population(target population may not have a natural frame, the mode of datacollection may restrict frame)

    Coverage Error - the extent to which the Sampling Frame does not cover

    the Target population Measurement bias - measuring instrument has tendency to differ

    from true value in one direction

    Measurement error (Errors of Observation)

    Deviations of measurement

    Inaccurate measurement

    Item nonresponse (didnt understand, didnt see, or refused question) Unit nonresponse (not home, not approached by interviewer, refuse call)

    Sampling error - results from taking a sample instead of wholepopulation, can be quantified by statistics, reduced by increasingsample size

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    15/22

    Sample Size Calculation

    In order to know what our sample size needs to be, we must

    decide in advance the maximum estimation error we are

    willing to tolerate.

    Determine the nature of estimationproportion or mean

    The confidence level of your estimationsignificant level

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    16/22

    Proportion (1)

    Proportion: p^ = X/n

    where X is the number of 'positive' observations, n is sample size

    When the observations are independent, the estimator has a

    binomial distribution, variance = np(1-p)

    The maximum variance of this distribution is 0.25*n, whenp=0.5

    For sufficiently large n, the distribution of p^ will be closely

    approximated by a normal distribution. around 95% of this

    distribution's probability lies within 2 standard deviations of

    the mean.

    will form a 95% confidence interval for the true proportion.

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    17/22

    Proportion (2)

    If this interval needs to be no more than Wunits wide, the

    equation

    can be solved for n, yielding n = 4/W2= 1/B2where B is the

    error bound on the estimate

    i.e., the estimate is usually given as within B. So,

    for B = 10% one requires n = 100,

    for B = 5% one needs n = 400,

    for B = 3% the requirement approximates to n = 1000,

    while for B = 1% a sample size of n = 10000 is required.

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    18/22

    Mean (1)

    A proportion is a special case of a mean. When estimating the

    population mean using an independent and identically

    distributed (iid) sample of size n, where each data value has

    variance 2, the standard error of the sample mean is:

    This expression describes quantitatively how the estimate

    becomes more precise as the sample size increases. Using the

    central limit theorem to justify approximating the sample

    mean with a normal distribution yields an approximate 95%

    confidence interval of the form

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    19/22

    Mean (2)

    If we wish to have a confidence interval that is Wunits in

    width, we would solve

    for n, yielding the sample size n= 162/W2.

    i.e., if we are interested in estimating the amount by which a

    drug lowers a subject's blood pressure with a confidenceinterval that is 6 units wide, and we know that the standard

    deviation of blood pressure in the population is 15, then the

    required sample size is 100

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    20/22

    Stratified Sample Size (1)

    The sample can often be split up into sub-samples. Typically, if

    there are k such sub-samples (from k different strata) then

    each of them will have a sample size ni, i = 1, 2, ..., k. These ni

    must conform to the rule that n1 + n2 + ... + nk = n (i.e. that

    the total sample size is given by the sum of the sub-samplesizes). Selecting these ni optimally can be done in various

    ways, using (for example) Neyman's optimal allocation.

    There are many reasons to use stratified sampling:[7] to

    decrease variances of sample estimates, to use partly non-

    random methods, or to study strata individually. A useful,partly non-random method would be to sample individuals

    where easily accessible, but, where not, sample clusters to

    save travel costs.

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    21/22

    Stratified Sample Size (2)

    In general, for Hstrata, a weighted sample mean is

  • 8/12/2019 introductiontosampling-131029161007-phpapp02

    22/22

    Thank [email protected]

    mailto:[email protected]://www.spryinc.com/http://www.spryinc.com/mailto:[email protected]