unit 5: inference for categorical variables lecture 1: inference ...unit 5: inference for...

8
Unit 5: Inference for categorical variables Lecture 1: Inference for proportions - theoretical Statistics 101 Mine C ¸ etinkaya-Rundel March 19, 2013 Announcements Feedback on project proposals Data collection: Don’t just copy text from the data source, rephrase and condense information. Scope of inference: Large n doesn’t ensure generalizability, we need a representative sample from the population, which is often provided by a random sample. Also, define “population at large”. EDA: Include univariate as well as bivariate EDA. Make sure to compare across groups, not just list characteristics. Even though inference might focus on the mean, also compare shape , spread , and discuss any unusual observations. IQR: simply stating the value is often not very informative, something like “the middle 50% of observations between Q1 and Q3” is more informative. Categorical variables: If you have more than 2 levels, collapse into 2 levels for the inference portion (no restrictions for EDA). This means you won’t be conducting ANOVA or chi-squared tests. Formatting: Smaller plots, but not too small. I will be posting a revised template for the project that uses smaller fonts to help with spacing. Population data: Fix, or discuss. There may be penalties for using population data. Data format: Organize such that each row is a case and each column is a variable (avoid subsets). “Come by OH”: Simply means easier to discuss in person, and you’ll benefit from it... Statistics 101 (Mine C ¸ etinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 2 / 29 Announcements Visualization of the day http:// gizmodo.com/ 5991141/ the- most-accurate- map- of- college-basketball- fandom Statistics 101 (Mine C ¸ etinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 3 / 29 Single population proportion Clicker question Two scientists want to know if a certain drug is effective against high blood pressure. The first scientist wants to give the drug to 1000 peo- ple with high blood pressure and see how many of them experience lower blood pressure levels. The second scientist wants to give the drug to 500 people with high blood pressure, and not give the drug to another 500 people with high blood pressure, and see how many in both groups experience lower blood pressure levels. Which is the better way to test this drug? (a) All 1000 get the drug (b) 500 get the drug, 500 don’t Statistics 101 (Mine C ¸ etinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 4 / 29

Upload: others

Post on 16-Feb-2021

49 views

Category:

Documents


0 download

TRANSCRIPT

  • Unit 5: Inference for categorical variablesLecture 1: Inference for proportions - theoretical

    Statistics 101

    Mine Çetinkaya-Rundel

    March 19, 2013

    Announcements

    Feedback on project proposals

    Data collection: Don’t just copy text from the data source,rephrase and condense information.Scope of inference: Large n doesn’t ensure generalizability, weneed a representative sample from the population, which is oftenprovided by a random sample. Also, define “population at large”.EDA: Include univariate as well as bivariate EDA. Make sure tocompare across groups, not just list characteristics. Even thoughinference might focus on the mean, also compare shape, spread,and discuss any unusual observations. IQR: simply stating thevalue is often not very informative, something like “the middle50% of observations between Q1 and Q3” is more informative.Categorical variables: If you have more than 2 levels, collapseinto 2 levels for the inference portion (no restrictions for EDA).This means you won’t be conducting ANOVA or chi-squaredtests.Formatting: Smaller plots, but not too small. I will be posting arevised template for the project that uses smaller fonts to helpwith spacing.Population data: Fix, or discuss. There may be penalties forusing population data.Data format: Organize such that each row is a case and eachcolumn is a variable (avoid subsets).“Come by OH”: Simply means easier to discuss in person, andyou’ll benefit from it...

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 2 / 29

    Announcements

    Visualization of the day

    http:// gizmodo.com/ 5991141/ the-most-accurate-map-of-college-basketball-fandom

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 3 / 29

    Single population proportion

    Clicker question

    Two scientists want to know if a certain drug is effective against highblood pressure. The first scientist wants to give the drug to 1000 peo-ple with high blood pressure and see how many of them experiencelower blood pressure levels. The second scientist wants to give thedrug to 500 people with high blood pressure, and not give the drugto another 500 people with high blood pressure, and see how manyin both groups experience lower blood pressure levels. Which is thebetter way to test this drug?

    (a) All 1000 get the drug

    (b) 500 get the drug, 500 don’t

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 4 / 29

    http://gizmodo.com/5991141/the-most-accurate-map-of-college-basketball-fandom

  • Single population proportion

    Results from the GSS

    The GSS asks the same question, below is the distribution ofresponses from the 2010 survey:

    All 1000 get the drug 99500 get the drug 500 don’t 571Total 670

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 5 / 29

    Single population proportion

    Parameter and point estimate

    We would like to estimate the proportion of all Americans who have agood intuition about experimental design, i.e. would answer “500 getthe drug 500 don’t”? What are the parameter of interest and the pointestimate?

    Parameter of interest: Proportion of all Americans who have agood intuition about experimental design.

    p (a population proportion)

    Point estimate: Proportion of sampled Americans who have agood intuition about experimental design.

    p̂ (a sample proportion) = 571/670 = 0.85

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 6 / 29

    Single population proportion

    Inference on a proportion

    What percent of all Americans have a good intuition about experimen-tal design, i.e. would answer “500 get the drug 500 don’t”?

    We can answer this research question using a confidenceinterval, which we know is always of the form

    point estimate ±MEAnd we also know that ME = critical value × standard error ofthe point estimate.

    SEp̂ =?

    Standard error of a sample proportion

    SEp̂ =

    √p (1 − p)

    n

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 7 / 29

    Single population proportion Identifying when a sample proportion is nearly normal

    Sample proportions are also nearly normally distributed

    Central limit theorem for proportionsSample proportions will be nearly normally distributed with mean equal

    to the population mean, p, and standard error equal to√

    p (1−p)n .

    p̂ ∼ Nmean = p,SE =

    √p (1 − p)

    n

    But of course this is true only under certain conditions...

    any guesses?

    Note: If p is unknown (most cases), we use p̂ in the calculation of the standard

    error.

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 8 / 29

  • Single population proportion Confidence intervals for a proportion

    Back to experimental design...

    The GSS found that 571 out of 670 (85%) of Americans answeredthe question on experimental design correctly. Estimate (using a 95%confidence interval) the proportion of all Americans who have a goodintuition about experimental design?

    Given: n = 670, p̂ = 0.85. First check conditions.

    1. Independence: The sample is random, and 670 < 10% of allAmericans, therefore we can assume that one respondent’sresponse is independent of another.

    2. Success-failure: 571 people answered correctly (successes) and99 answered incorrectly (failures), both are greater than 10.

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 9 / 29

    Single population proportion Confidence intervals for a proportion

    Clicker question

    We are given that n = 670, p̂ = 0.85, we also just learned that the

    standard error of the sample proportion is SE =√

    p(1−p)n . Which of

    the below is the correct calculation of the 95% confidence interval?

    (a) 0.85 ± 1.96 ×√

    0.85×0.15670

    (b) 0.85 ± 1.65 ×√

    0.85×0.15670

    (c) 0.85 ± 1.96 × 0.85×0.15√670

    (d) 571 ± 1.96 ×√

    571×99670

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 10 / 29

    Single population proportion Choosing a sample size when estimating a proportion

    Choosing a sample size

    How many people should you sample in order to cut the margin of errorof a 95% confidence interval down to 1%.

    ME = z? × SE

    0.01 ≥ 1.96 ×√

    0.85 × 0.15n

    → Use estimate for p̂ from previous study

    0.012 ≥ 1.962 × 0.85 × 0.15n

    n ≥ 1.962 × 0.85 × 0.15

    0.012n ≥ 4898.04→ n should be at least 4,899

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 11 / 29

    Single population proportion Choosing a sample size when estimating a proportion

    What if there isn’t a previous study?

    ... use p̂ = 0.5

    why?

    if you don’t know any better, 50-50 is a good guess

    p̂ = 0.5 gives the most conservative estimate – highest possiblesample size

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 12 / 29

  • Single population proportion Hypothesis testing for a proportion

    CI vs. HT for proportions

    Success-failure condition:CI: At least 10 observed successes and failuresHT: At least 10 expected successes and failures, calculated usingthe null value

    Standard error:

    CI: calculate using observed sample proportion: SE =√

    p(1−p)n

    HT: calculate using the null value: SE =√

    p0(1−p0)n

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 13 / 29

    Single population proportion Hypothesis testing for a proportion

    The GSS found that 571 out of 670 (85%) of Americans answeredthe question on experimental design correctly. Do these data provideconvincing evidence that more than 80% of Americans have a goodintuition about experimental design?

    H0 : p = 0.80 HA : p > 0.80

    SE =

    √0.80 × 0.20

    670= 0.0154

    Z =0.85 − 0.80

    0.0154= 3.25

    p − value = 1 − 0.9994 = 0.0006sample proportions

    0.8 0.85

    Since p-value is low we reject H0. The data provide convincingevidence that more than 80% of Americans have a good intuition onexperimental design.

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 14 / 29

    Single population proportion Hypothesis testing for a proportion

    Clicker question

    11% of 1,001 Americans responding to a 2006 Gallup survey statedthat they have objections to celebrating Halloween on religiousgrounds. At 95% confidence level, the margin of error for this survey ais ±3%. A news piece on this study’s findings states: “More than 10%of all Americans have objections on religious grounds to celebratingHalloween.” At 95% confidence level, is this news piece’s statementjustified?

    (a) Yes

    (b) No

    (c) Cannot tell

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 15 / 29

    Difference of two proportions

    Melting ice cap

    Clicker question

    Scientists predict that global warming may have big effects on the polarregions within the next 100 years. One of the possible effects is thatthe northern ice cap may completely melt. Would this bother you agreat deal, some, a little, or not at all if it actually happened?

    (a) A great deal

    (b) Some

    (c) A little

    (d) Not at all

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 16 / 29

  • Difference of two proportions

    Results from the GSS

    The GSS asks the same question, below is the distribution ofresponses from the 2010 survey:

    A great deal 454Some 124A little 52Not at all 50Total 680

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 17 / 29

    Difference of two proportions

    Parameter and point estimate

    Parameter of interest: Difference between the proportions of allDuke students and all Americans who would be bothered a greatdeal by the northern ice cap completely melting.

    pDuke − pUS

    Point estimate: Difference between the proportions of sampledDuke students and sampled Americans who would be bothereda great deal by the northern ice cap completely melting.

    p̂Duke − p̂US

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 18 / 29

    Difference of two proportions

    Inference for comparing proportions

    The details are the same as before...

    CI: point estimate ±margin of errorHT: Use Z = point estimate−null valueSE to find appropriate p-value.

    We just need the appropriate standard error of the point estimate(SEp̂Duke−p̂US ), which is the only new concept.

    Standard error of the difference between two sample proportions

    SE(p̂1−p̂2) =

    √p1(1 − p1)

    n1+

    p2(1 − p2)n2

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 19 / 29

    Difference of two proportions Confidence intervals for difference of proportions

    Conditions for CI for difference of proportions

    1 Independencewithin groups:

    The US group is sampled randomly and we’re assuming that theDuke group represents a random sample as well.nDuke < 10% of all Duke students and 680 < 10% of all Americans.

    We can assume that the attitudes of Duke students in the sampleare independent of each other, and attitudes of US residents inthe sample are independent of each other as well.between groups: The sampled Duke students and the USresidents are independent of each other.

    2 Success-failure:At least 10 observed successes and 10 observed failures in thetwo groups.

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 20 / 29

  • Difference of two proportions Confidence intervals for difference of proportions

    Application exercise:CI for difference of proportions

    Construct a 95% confidence interval for the difference between theproportions of Duke students and Americans who would be bothereda great deal by the melting of the northern ice cap (pDuke − pUS ).

    Use your clicker to submit your response for the lower bound of yourconfidence interval, rounded to 2 decimal points.

    Data Duke USA great deal 454Not a great deal 226Total 680

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 21 / 29

    Difference of two proportions HT for comparing proportions

    Clicker question

    Which of the following is the correct set of hypotheses for testing if theproportion of all Duke students who would be bothered a great dealby the melting of the northern ice cap differs from the proportion of allAmericans who do?

    (a) H0 : pDuke = pUSHA : pDuke , pUS

    (b) H0 : p̂Duke = p̂USHA : p̂Duke , p̂US

    (c) H0 : pDuke − pUS = 0HA : pDuke − pUS , 0

    (d) H0 : pDuke = pUSHA : pDuke < pUS

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 22 / 29

    Difference of two proportions HT for comparing proportions

    Flashback to working with one proportion

    When constructing a confidence interval for a populationproportion, we check if the observed number of successes andfailures are at least 10.

    np̂ ≥ 10 n(1 − p̂) ≥ 10

    When conducting a hypothesis test for a population proportion,we check if the expected number of successes and failures areat least 10.

    np0 ≥ 10 n(1 − p0) ≥ 10

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 23 / 29

    Difference of two proportions HT for comparing proportions

    Pooled estimate of a proportion

    In the case of comparing two proportions where H0 : p1 = p2,there isn’t a given null value we can use to calculated theexpected number of successes and failures in each sample.

    Therefore, we need to first find a common (pooled) proportion forthe two groups, and use that in our analysis.

    This simply means finding the proportion of total successesamong the total number of observations.

    Pooled estimate of a proportion

    p̂ =# of successes1 +# of successes2

    n1 + n2

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 24 / 29

  • Difference of two proportions HT for comparing proportions

    Application exercise:Pooled estimate of a proportion - in context

    Calculate the estimated pooled proportion of Duke students andAmericans who would be bothered a great deal by the melting of thenorthern ice cap. Which sample proportion (p̂Duke or p̂US ) the pooledestimate is closer to? Why?

    Use your clicker to submit a numerical response, rounded to 3 decimalplaces.

    Data Duke USA great deal 454Not a great deal 226Total 680

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 25 / 29

    Difference of two proportions HT for comparing proportions

    Application exercise:HT for comparing proportions

    Do these data suggest that the proportion of all Duke students whowould be bothered a great deal by the melting of the northern ice capdiffers from the proportion of all Americans who do? Calculate the teststatistic, the p-value, and interpret your conclusion in context of thedata.

    Use your clicker to submit the value of the test statistic you calculate.

    Data Duke USp̂ 0.668n 680

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 26 / 29

    Recap

    Recap - inference for one proportion

    Population parameter: p, point estimate: p̂Conditions:

    independence- random sample and 10% conditionat least 10 successes and failures- if not→ randomization

    Standard error: SE =√

    p(1−p)n

    for CI: use p̂for HT: use p0

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 27 / 29

    Recap

    Recap - comparing two proportions

    Population parameter: (p1 − p2), point estimate: (p̂1 − p̂2)Conditions:

    independence within groups- random sample and 10% condition met for both groupsindependence between groupsat least 10 successes and failures in each group- if not→ randomization

    SE(p̂1−p̂2) =√

    p1(1−p1)n1

    +p2(1−p2)

    n2for CI: use p̂1 and p̂2for HT:

    when H0 : p1 = p2: use p̂pool =# suc1+#suc2

    n1+n2when H0 : p1 − p2 = (some value other than 0): use p̂1 and p̂2- this is pretty rare

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 28 / 29

  • Recap

    Reference - standard error calculations

    one sample two samples

    mean SE = s√n

    SE =√

    s21n1

    +s22n2

    proportion SE =√

    p(1−p)n SE =

    √p1(1−p1)

    n1+

    p2(1−p2)n2

    When working with means, it’s very rare that σ is known, so weusually use s.When working with proportions,

    if doing a hypothesis test, p comes from the null hypothesisif constructing a confidence interval, use p̂ instead

    Statistics 101 (Mine Çetinkaya-Rundel) U5 - L1: Inf. for prop.s - theoretical March 19, 2013 29 / 29

    AnnouncementsSingle population proportionIdentifying when a sample proportion is nearly normalConfidence intervals for a proportionChoosing a sample size when estimating a proportionHypothesis testing for a proportion

    Difference of two proportionsConfidence intervals for difference of proportionsHT for comparing proportions

    Recap