hypothesis testing, part 2mmazurek/634-slides/14-stats2.pdf · 47 alternative scenarios •one...

1

Hypothesis testing, part 2

With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

2

CATEGORICAL IV, NUMERIC DV

3

Independent samples, one IV

# Conditions Normal/Parametric Non-parametricExactly 2 T-test Mann-Whitney U, bootstrap2+ One-way ANOVA Kruskal-Wallis, bootstrap

4

Is your data normal?

• Skewness: asymmetry

• Kurtosis: “peakedness” rel. to normal

– Both: within +- 2SE(s/u) is OK

• Or use Shapiro-Wilk (null = normal)

• Or look at Q-Q plot

5

T-test

• Already talked about

• Assumptions: normality, equal variances, independent samples

– Can use Levene to test equal variance assumption

• Post-test: check residuals for assumption fit

– For a t-test this is the same pre or post– For other tests you check residual vs. fit post

6

One way ANOVA

• H0: m1 = m2 = m3

• H1: at least one doesn’t match

• NOT H1: m1 != m2 != m3

• Assumptions: normality, common variance, independent errors

• Intuition: F statistic

– Variance between / Variance within– Under (exact null), F=1; F >> 1 rejects null

7

One-way ANOVA

• F = MSb / MSw

• MSw = sum [sum[ (diff from mean)2 ]] / dfw

– dfw = N-k, where k = number of conditions– Sum over all conditions; sum per condition

• MSb = sum [(diff from grand mean)2] / dfb

– dfb = k-1– Every observation goes in the sum

8(example from Vibha Sazawal)

10

F-distribution

rejected

11

Now what? (Contrasts)

• So we rejected the null. What did we learn?

– What *didn’t* we learn?– At least one is different ... Which? All?– This is called an “omnibus test”

• To answer our actual research question, we usually need pairwise contrasts

12

The trouble with contrasts

• Contrasts mess with your Type I bounds

– One test: 95% confident– Three tests: 85.7% confident– 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9%– UH OH

13

Planned vs. post hoc

• Planned: You have a theory.

– Really, no cheating– You get n-1 pairwise comparisons for free– In theory, should not be control vs. all, but prob. OK– NO COMPARISONS unless omnibus passes

• Post-hoc

– Anything unplanned– More than n-1– Requires correction!– Doesn’t necessarily require omnibus first

14

Correction

• Adjust {p-values, alpha} to compensate for multiple testing post-hoc

• Bonferroni (most conservative)

– Assume all possible pairs: m = k(k-1)/m (comb.)– alphac = alpha / m– Once you have looked, implication is you did all the

comparisons implicitly!

• Holm-Bonferroni is less conservative

– Stepwise adjusting alpha as you go

• Dunnett for specifically all vs. control, others

15

Independent samples, one IV

# Conditions Normal/Parametric Non-parametricExactly 2 T-test Mann-Whitney U, bootstrap2+ One-way ANOVA Kruskal-Wallis, bootstrap

16

Non-parametrics: MWU and K-W

• Good for non-normal data, likert data (ordinal, not actually numeric)

• Assumptions: independent, at least ordinal

• Null: P(X > Y) = P(Y > X) where X,Y are observations from the 2 distributions (MWU)

– If assume same distribution shape, continuous then this can can be seen as comparing medians

17

MWU and K-W continued

• Essentially: rank order all data (both conditions)

– Total ranks for condition 1, compare to “expected”– Various procecures to correct for ties

18

Bootstrap

• Resampling technique(s)

• Intuition:

– Create “null” distribution by e.g. subtracting means so mA = mB = 0

• Now you have shifted samples A-hat and B-hat– Combine these to make a null distribution– Draw sample of size N, with replacement

• Do it 1000 (or 10k) times– Use this to determine critical value (alpha = 0.05)– Compare this critical value to your real data for test

19

Paired samples, one IV

# Conditions Normal/Parametric Non-parametricExactly 2 Paired T-test Wilcoxon signed-rank2+ 2-way ANOVA w/

subject random factorMixed models(later)

Friedman

20

Paired T-test

• Two samples per participant item

• Test subtracts them

• Then uses one-sample T-test with H0: m = 0 and H1: m != 0

• Regular T-test assumptions, plus: does subtraction make sense here?

21

Wilcoxon S.R. / Friedman

• H0: difference btwn pairs is symmetric around 0

• H1: … or not

• Excludes no-change items

• Essentially: rank by abs. difference; compare signs * ranks

• (Friedman = 3+ generalization)

22

SIMPLE LINEAR REGRESSIONOne numeric IV, numeric DV

23

Simple linear regression

• E(Y|x) = b0 + b1x … looks at populations

– Population mean at this value of x

• Key H0: b1 != 0

– b0 usually not important for significance (obv. important in model fit)

• b1 : slope à change in Y per unit X

• Best fit: Least squares, or maximum likelihood

– LSq: minimize sum of squares of residuals– ML: max prob. of seeing this data with this model

24

Assumptions, caveats

• Assumes:

– linearity in Y ~ X– normally distributed error for each x, with constant

variance at all x– Error measuring X is small compared to var. Y (fixed X)

• Independent errors!

– Serial correlation, data that is grouped, etc. (later)

• Don’t interpret widely outside available x vals

• Can transform for linearity!

– Log(Y), sqrt(y), 1/y, y^2

25

Assumption/residual checking

• Before: Use scatterplot for plausible linearity

• After: residual vs. fit

– Residual on Y vs. predicted on X– Should be relatively even distributed around 0 (linear)– Should have relatively even v. spread (eq. var)

• After: quantile-normal of residuals

26

Model interpretation

• Interpret b1, interpret the p-value

• CI: if it crosses 0, it’s not significant

• R2: fraction of total variation accounted for

– Intutively: explained variance / total variance– Explained = var(Y) – residual errors

• F2 = R2 / (1 – R R2); SML: 0.02, 0.15, 0.35 (cohen)

27

Robustness

• Brittle to linearity, independent errors

• Somewhat brittle to fixed-X

• Fairly robust to equal variance

• Quite robust to normality

28

CATEGORICAL OUTCOMES

29

One Cat. IV, Cat. DV, independent

• Contingency tables: how many people in each combination of categories

30

Chi-square test of independence

• H0: distribution of Var1 is the same at every level of Var2 (and vice versa)

– Null dist. Approaches X^2 when sample size grows– Heuristic: no cells < 5 – Can use FET instead

• Intuition:

– Sum over rows/columns: (observed – expected)^2 / expected

– Expected: marginal % * count in other margin

31

Paired 2x2 tables

• Use McNemar’s test

– Contigency table: matches and mismatches for each option.

• H0: marginals are the same

• Essentially a X^2 test on the agreement

– Test stat: (b-c)^2 / (b+c)

Cond1: Yes Cond 1: NoCond2: Yes a b a + b Cond2: No c d c + d

a + c b + d N

32

Paired, continued

• Cochran’s Q: extended for more than two conditions

• Other similar extensions for related tasks

33

Critiques

• Choose a paper that has one (or more) empirical experiments as a central contribution

– Doesn’t have to be human subjects, but can be

– Does have to have enough description of experiment

• 10-12 minute presentation

• Briefly: research questions, necessary background

• Main: describe and critique methods

– Experimental design, data collection, analysis

– Good, bad, ugly, missing

• Briefly, results?

34

Logistic regression (logit)

• Numeric IV, binary DV (or ordinal)

• log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b0 + b1x

• Log odds of success = linear function

– Odds: 0 to inf., 1 is the middle– e.g.: odds = 5 = 5:1 … for five successes, one fail– Log odds: -inf to inf w/ 0 in the middle: good for

regression

• Modeled as binomial distribution

35

Interpreting logistic regression

• Take exp(coef) to get interpretable odds.

• For each unit increase in x, odds increase b1 times

– Note that this can make small coefs important!

• Use e.g., Homer-Lemeshow test for goodness of fit – null == data fit the model

– But not a lot of power!

36

MULTIVARIATE

37

Multiple regression

• Linear/logistic regression with more variables!

– At least one numeric, 0+ categorical

• Still: fixed x, normal errors w/ equal variance, independent errors (linear)

• Linear relationship in E(Y) and one x, when other inputs held constant

– Effects of each x are independent!

• Still check q-n of residuals, residual vs. fit

38

Model selection

• Which covariates to keep? (more on this in a bit)

39

Adding categorical vars

• Indicator variables (everything is 0 or 1)

• Need one fewer indicator than conditions

– One condition is true; or none are true (baseline)– Coefs are *relative to baseline*!

• Model selection: keep all or none for one factor

• Called “ANCOVA” when at least one each numeric + categorical

40

Interaction

• What if your covariates *aren’t* independent?

• E(Y) = b0 + b1x1 + b2x2 + b12x1x2

– Slope for x1 is diff. for each value of x2

• Superadditive: all in same direction, interaction makes effects stronger

• Subadditive: interaction is in opposite direction

• For indicator vars, all or none

41

Model selection!

• Which covariates to keep?

• From theory

• Keep interaction only if it’s significant?

– If keep interaction, should keep corresponding mains

• ”Adjusted” R^2?

– Regular R^2 always higher w/ more covars

• BIC and AIC

– Take model likelihood and penalize for more params

– Abs value not interpretable; lower is better

• All combinations? Stepwise?

42

THINGS WE ARE ONLY GOING TO MENTION BRIEFLY

Know they exist; look them up if relevant

43

Multi-way ANOVA

• >1 cat IVs, 1 numeric DV

• Normality, equal variance, indep. Errors

• With interaction: every combo of factor levels has its own population mean

• Without interaction (additive): change in one varconsistent as all fixed vals for others

• Works basically like standard ANOVA, etc.

44

Mixed models regression

• Explicitly model correlations in data

• Fixed effects: affect outcome for everyone

• Random effects: deviations per data item, don’t want to model individually

• Simplest example: repeated measures

– Y ~ b0 + b1x1 + b2x2 …. + random ID intercept– Each participant has their own intercept adjustment

45

POWER ANALYSIS

46

What is power?

• Null distribution: designed so that we’d only see a test statistic this extreme 5% of the time

• This bounds type I but not type II

• Power = 1 – type II error rate

• Heuristic: 80% is “good enough”

47

Alternative scenarios

• One null, but infinitely many alternatives!

• Alternative distribution: given some n, underlying variance, underlying diff. in pop. means, what is the distribution of test statistic

• You know the critical value, so tells you how often your p will be above 0.05 when the “true” scenario is as you model

48

Calculating power

• A priori, to think about sample size and judge value of experiment

• Inherently requires estimating the alternative scenario!

– Maybe try a few

• Statistic-specific, but in general:

– Sample size, effect size, power, alpha

• “Consider the smallest effect size that you consider interesting and try to achieve reasonable power for that effect size”

49

Example from Seltman book

• F statistic (ANOVA)

• 3 treatments

• 50 people each

• Red: sigma = 10, means: 10, 12, 14

• Blue: sigma = 10,means: 10, 13, 16

50

Promoting power

• (Review from earlier)

• Raise sample size; reduce variance; aim for bigger effects

51

Walkthrough: linear regression

• u = model df -> number of params

• v = F-test error df -> N – u – 1

• f2 = r2 / (1 – r2) … r2 = f2 / (1 + f2)

52

Retrospective power

• Somewhat controversial

• Calculate observed effect size, then determine what sample size would be needed

– Whole new experiment, not just collect more

• Not a good idea:

– We didn’t find a significant effect, but if we had studied 12 more people …

hypothesis testing, part 2mmazurek/634-slides/14-stats2.pdf · 47 alternative scenarios •one...

Documents