hypothesis testing, part 2mmazurek/634-slides/14-stats2.pdf · 47 alternative scenarios •one...
TRANSCRIPT
1
Hypothesis testing, part 2
With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal
2
CATEGORICAL IV, NUMERIC DV
3
Independent samples, one IV
# Conditions Normal/Parametric Non-parametricExactly 2 T-test Mann-Whitney U, bootstrap2+ One-way ANOVA Kruskal-Wallis, bootstrap
4
Is your data normal?
• Skewness: asymmetry
• Kurtosis: “peakedness” rel. to normal
– Both: within +- 2SE(s/u) is OK
• Or use Shapiro-Wilk (null = normal)
• Or look at Q-Q plot
5
T-test
• Already talked about
• Assumptions: normality, equal variances, independent samples
– Can use Levene to test equal variance assumption
• Post-test: check residuals for assumption fit
– For a t-test this is the same pre or post– For other tests you check residual vs. fit post
6
One way ANOVA
• H0: m1 = m2 = m3
• H1: at least one doesn’t match
• NOT H1: m1 != m2 != m3
• Assumptions: normality, common variance, independent errors
• Intuition: F statistic
– Variance between / Variance within– Under (exact null), F=1; F >> 1 rejects null
7
One-way ANOVA
• F = MSb / MSw
• MSw = sum [sum[ (diff from mean)2 ]] / dfw
– dfw = N-k, where k = number of conditions– Sum over all conditions; sum per condition
• MSb = sum [(diff from grand mean)2] / dfb
– dfb = k-1– Every observation goes in the sum
8(example from Vibha Sazawal)
9
10
F-distribution
rejected
11
Now what? (Contrasts)
• So we rejected the null. What did we learn?
– What *didn’t* we learn?– At least one is different ... Which? All?– This is called an “omnibus test”
• To answer our actual research question, we usually need pairwise contrasts
12
The trouble with contrasts
• Contrasts mess with your Type I bounds
– One test: 95% confident– Three tests: 85.7% confident– 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9%– UH OH
13
Planned vs. post hoc
• Planned: You have a theory.
– Really, no cheating– You get n-1 pairwise comparisons for free– In theory, should not be control vs. all, but prob. OK– NO COMPARISONS unless omnibus passes
• Post-hoc
– Anything unplanned– More than n-1– Requires correction!– Doesn’t necessarily require omnibus first
14
Correction
• Adjust {p-values, alpha} to compensate for multiple testing post-hoc
• Bonferroni (most conservative)
– Assume all possible pairs: m = k(k-1)/m (comb.)– alphac = alpha / m– Once you have looked, implication is you did all the
comparisons implicitly!
• Holm-Bonferroni is less conservative
– Stepwise adjusting alpha as you go
• Dunnett for specifically all vs. control, others
15
Independent samples, one IV
# Conditions Normal/Parametric Non-parametricExactly 2 T-test Mann-Whitney U, bootstrap2+ One-way ANOVA Kruskal-Wallis, bootstrap
16
Non-parametrics: MWU and K-W
• Good for non-normal data, likert data (ordinal, not actually numeric)
• Assumptions: independent, at least ordinal
• Null: P(X > Y) = P(Y > X) where X,Y are observations from the 2 distributions (MWU)
– If assume same distribution shape, continuous then this can can be seen as comparing medians
17
MWU and K-W continued
• Essentially: rank order all data (both conditions)
– Total ranks for condition 1, compare to “expected”– Various procecures to correct for ties
18
Bootstrap
• Resampling technique(s)
• Intuition:
– Create “null” distribution by e.g. subtracting means so mA = mB = 0
• Now you have shifted samples A-hat and B-hat– Combine these to make a null distribution– Draw sample of size N, with replacement
• Do it 1000 (or 10k) times– Use this to determine critical value (alpha = 0.05)– Compare this critical value to your real data for test
19
Paired samples, one IV
# Conditions Normal/Parametric Non-parametricExactly 2 Paired T-test Wilcoxon signed-rank2+ 2-way ANOVA w/
subject random factorMixed models(later)
Friedman
20
Paired T-test
• Two samples per participant item
• Test subtracts them
• Then uses one-sample T-test with H0: m = 0 and H1: m != 0
• Regular T-test assumptions, plus: does subtraction make sense here?
21
Wilcoxon S.R. / Friedman
• H0: difference btwn pairs is symmetric around 0
• H1: … or not
• Excludes no-change items
• Essentially: rank by abs. difference; compare signs * ranks
• (Friedman = 3+ generalization)
22
SIMPLE LINEAR REGRESSIONOne numeric IV, numeric DV
23
Simple linear regression
• E(Y|x) = b0 + b1x … looks at populations
– Population mean at this value of x
• Key H0: b1 != 0
– b0 usually not important for significance (obv. important in model fit)
• b1 : slope à change in Y per unit X
• Best fit: Least squares, or maximum likelihood
– LSq: minimize sum of squares of residuals– ML: max prob. of seeing this data with this model
24
Assumptions, caveats
• Assumes:
– linearity in Y ~ X– normally distributed error for each x, with constant
variance at all x– Error measuring X is small compared to var. Y (fixed X)
• Independent errors!
– Serial correlation, data that is grouped, etc. (later)
• Don’t interpret widely outside available x vals
• Can transform for linearity!
– Log(Y), sqrt(y), 1/y, y^2
25
Assumption/residual checking
• Before: Use scatterplot for plausible linearity
• After: residual vs. fit
– Residual on Y vs. predicted on X– Should be relatively even distributed around 0 (linear)– Should have relatively even v. spread (eq. var)
• After: quantile-normal of residuals
26
Model interpretation
• Interpret b1, interpret the p-value
• CI: if it crosses 0, it’s not significant
• R2: fraction of total variation accounted for
– Intutively: explained variance / total variance– Explained = var(Y) – residual errors
• F2 = R2 / (1 – R R2); SML: 0.02, 0.15, 0.35 (cohen)
27
Robustness
• Brittle to linearity, independent errors
• Somewhat brittle to fixed-X
• Fairly robust to equal variance
• Quite robust to normality
28
CATEGORICAL OUTCOMES
29
One Cat. IV, Cat. DV, independent
• Contingency tables: how many people in each combination of categories
30
Chi-square test of independence
• H0: distribution of Var1 is the same at every level of Var2 (and vice versa)
– Null dist. Approaches X^2 when sample size grows– Heuristic: no cells < 5 – Can use FET instead
• Intuition:
– Sum over rows/columns: (observed – expected)^2 / expected
– Expected: marginal % * count in other margin
31
Paired 2x2 tables
• Use McNemar’s test
– Contigency table: matches and mismatches for each option.
• H0: marginals are the same
• Essentially a X^2 test on the agreement
– Test stat: (b-c)^2 / (b+c)
Cond1: Yes Cond 1: NoCond2: Yes a b a + b Cond2: No c d c + d
a + c b + d N
32
Paired, continued
• Cochran’s Q: extended for more than two conditions
• Other similar extensions for related tasks
33
Critiques
• Choose a paper that has one (or more) empirical experiments as a central contribution
– Doesn’t have to be human subjects, but can be
– Does have to have enough description of experiment
• 10-12 minute presentation
• Briefly: research questions, necessary background
• Main: describe and critique methods
– Experimental design, data collection, analysis
– Good, bad, ugly, missing
• Briefly, results?
34
Logistic regression (logit)
• Numeric IV, binary DV (or ordinal)
• log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b0 + b1x
• Log odds of success = linear function
– Odds: 0 to inf., 1 is the middle– e.g.: odds = 5 = 5:1 … for five successes, one fail– Log odds: -inf to inf w/ 0 in the middle: good for
regression
• Modeled as binomial distribution
35
Interpreting logistic regression
• Take exp(coef) to get interpretable odds.
• For each unit increase in x, odds increase b1 times
– Note that this can make small coefs important!
• Use e.g., Homer-Lemeshow test for goodness of fit – null == data fit the model
– But not a lot of power!
36
MULTIVARIATE
37
Multiple regression
• Linear/logistic regression with more variables!
– At least one numeric, 0+ categorical
• Still: fixed x, normal errors w/ equal variance, independent errors (linear)
• Linear relationship in E(Y) and one x, when other inputs held constant
– Effects of each x are independent!
• Still check q-n of residuals, residual vs. fit
38
Model selection
• Which covariates to keep? (more on this in a bit)
39
Adding categorical vars
• Indicator variables (everything is 0 or 1)
• Need one fewer indicator than conditions
– One condition is true; or none are true (baseline)– Coefs are *relative to baseline*!
• Model selection: keep all or none for one factor
• Called “ANCOVA” when at least one each numeric + categorical
40
Interaction
• What if your covariates *aren’t* independent?
• E(Y) = b0 + b1x1 + b2x2 + b12x1x2
– Slope for x1 is diff. for each value of x2
• Superadditive: all in same direction, interaction makes effects stronger
• Subadditive: interaction is in opposite direction
• For indicator vars, all or none
41
Model selection!
• Which covariates to keep?
• From theory
• Keep interaction only if it’s significant?
– If keep interaction, should keep corresponding mains
• ”Adjusted” R^2?
– Regular R^2 always higher w/ more covars
• BIC and AIC
– Take model likelihood and penalize for more params
– Abs value not interpretable; lower is better
• All combinations? Stepwise?
42
THINGS WE ARE ONLY GOING TO MENTION BRIEFLY
Know they exist; look them up if relevant
43
Multi-way ANOVA
• >1 cat IVs, 1 numeric DV
• Normality, equal variance, indep. Errors
• With interaction: every combo of factor levels has its own population mean
• Without interaction (additive): change in one varconsistent as all fixed vals for others
• Works basically like standard ANOVA, etc.
44
Mixed models regression
• Explicitly model correlations in data
• Fixed effects: affect outcome for everyone
• Random effects: deviations per data item, don’t want to model individually
• Simplest example: repeated measures
– Y ~ b0 + b1x1 + b2x2 …. + random ID intercept– Each participant has their own intercept adjustment
45
POWER ANALYSIS
46
What is power?
• Null distribution: designed so that we’d only see a test statistic this extreme 5% of the time
• This bounds type I but not type II
• Power = 1 – type II error rate
• Heuristic: 80% is “good enough”
47
Alternative scenarios
• One null, but infinitely many alternatives!
• Alternative distribution: given some n, underlying variance, underlying diff. in pop. means, what is the distribution of test statistic
• You know the critical value, so tells you how often your p will be above 0.05 when the “true” scenario is as you model
48
Calculating power
• A priori, to think about sample size and judge value of experiment
• Inherently requires estimating the alternative scenario!
– Maybe try a few
• Statistic-specific, but in general:
– Sample size, effect size, power, alpha
• “Consider the smallest effect size that you consider interesting and try to achieve reasonable power for that effect size”
49
Example from Seltman book
• F statistic (ANOVA)
• 3 treatments
• 50 people each
• Red: sigma = 10, means: 10, 12, 14
• Blue: sigma = 10,means: 10, 13, 16
50
Promoting power
• (Review from earlier)
• Raise sample size; reduce variance; aim for bigger effects
51
Walkthrough: linear regression
• u = model df -> number of params
• v = F-test error df -> N – u – 1
• f2 = r2 / (1 – r2) … r2 = f2 / (1 + f2)
52
Retrospective power
• Somewhat controversial
• Calculate observed effect size, then determine what sample size would be needed
– Whole new experiment, not just collect more
• Not a good idea:
– We didn’t find a significant effect, but if we had studied 12 more people …