statistics for variationists - or - what a linguist needs to know about statistics sean wallis...

53
Statistics for variationists Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London [email protected]

Upload: bennett-gibson

Post on 06-Jan-2018

216 views

Category:

Documents


1 download

DESCRIPTION

What is the point of statistics? Analyse data you already have –corpus linguistics Design new experiments –collect new data, add annotation –experimental linguistics ‘in the lab’ Try new methods –pose the right question We are going to focus on z and χ² tests

TRANSCRIPT

Page 1: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Statistics for variationistsStatistics for variationists- or -

what a linguist needs to know about statistics

Sean WallisSurvey of English Usage

University College [email protected]

Page 2: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

OutlineOutline• What is the point of statistics?

– Variationist corpus linguistics– How inferential statistics works

• Introducing z tests– Two types (single-sample and two-sample)– How these tests are related to χ²

• ‘Effect size’ and comparing results of experiments

• Methodological implications for corpus linguistics

Page 3: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics ‘in the lab’

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

Page 4: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics ‘in the lab’

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

experimental science}

observational science}

philosophy of science}

a little maths}

Page 5: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

What is ‘What is ‘inferentialinferential statistics’?statistics’?• Suppose we carry out an experiment

– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

• Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment– We infer from the sample to the population

• Let us consider one type of experiment– Linguistic alternation experiments

Page 6: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Alternation experimentsAlternation experiments• A variationist corpus paradigm• Imagine a speaker forming a sentence as

a series of decisions/choices. They can– add: choose to extend a phrase or clause, or

stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

Page 7: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Alternation experimentsAlternation experiments• A variationist corpus paradigm• Imagine a speaker forming a sentence as a series

of decisions/choices. They can– add: choose to extend a phrase or clause, or stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

• Research question: – within these constraints,

what factors influence the particular choice?

Page 8: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Alternation experimentsAlternation experiments• Laboratory experiment (cued)

– pose the choice to subjects – observe the one they make– manipulate different potential influences

• Observational experiment (uncued)– observe the choices speakers make when they make

them (e.g. in a corpus)– extract data for different potential influences

• sociolinguistic: subdivide data by genre, etc• lexical/grammatical: subdivide data by elements in

surrounding context– BUT the alternate choice is counterfactual

Page 9: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Statistical assumptionsStatistical assumptionsA random sample taken from the population

– Not always easy to achieve• multiple cases from the same text and speakers, etc• may be limited historical data available

– Be careful with data concentrated in a few textsThe sample is tiny compared to the population

– This is easy to satisfy in linguistics!Observations are free to vary (alternate)Repeated sampling tends to form a Binomial

distribution around the expected mean– This requires slightly more explanation...

Page 10: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distribution around the expected mean P

F

N = 1

x

531 7 9

• We toss a coin 10 times, and get 5 heads

P

Page 11: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 4

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P

Page 12: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 8

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P

Page 13: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 12

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P

Page 14: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 16

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P

Page 15: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 20

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P

Page 16: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 24

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P

Page 17: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Binomial Binomial Normal Normal• The Binomial (discrete) distribution is close to

the Normal (continuous) distribution

x

F

531 7 9

Page 18: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

– With more data in the experiment, s will be smaller

p0.50.30.1 0.7

– Divide x by 10 for probability scale

population

mean P

standard deviations = P(1 – P) / n

Page 19: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

2.5% 2.5%

population

mean P

– 95% of the curve is within ~2 standard deviations of the expected mean

standard deviations = P(1 – P) / n

p0.50.30.1 0.7

95%

– the correct figure is 1.95996!

= the critical value of z for an error level of 0.05.

Page 20: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

2.5% 2.5%

population

mean P

standard deviations = P(1 – P) / n

p0.50.30.1 0.7

95%

= the critical value of z for an error level of 0.05.

z/2

Page 21: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The single-sample The single-sample zz test...test...• Is an observation p > z standard deviations

from the expected (population) mean P?

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p• If yes, p is

significantly different from P

Page 22: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– We want to plot the interval about p

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

Page 23: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– We want to plot the interval about p

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

Page 24: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval• This interval is

asymmetric• It reflects the

Normal interval about P:

• If P is at the upper limit of p,p is at the lower limit of P

(Wallis, to appear, a)

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

Page 25: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval• To calculate w–

and w+ we use this formula:

nz

nz

nppz

nzp

2

2

22

1

4)1(

2

(Wilson, 1927)

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

Page 26: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Plotting confidence intervalsPlotting confidence intervals• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})• Small amounts

of data / year• Highly skewed p

in some cases– p = 0 or 1

(circled)• Confidence

intervals identify the degree of certainty in our results

(Wallis, to appear, a)

Page 27: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Plotting confidence intervalsPlotting confidence intervals• Probability of adding successive

attributive adjective phrases (AJPs) to a NP in ICE-GB

– x = number of AJPs • NPs get longer adding AJPs is more difficult

• The first two falls are significant, the last is not0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4

p

x

Page 28: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

2 x 1 goodness of fit χ² test2 x 1 goodness of fit χ² test• Same as single-sample z test for P (z²

= χ²)– Does the value of a affect p(b)?

z . s z . s

F

P = p(b)

p

p(b | a)

p(b | a)

p(b)

IV: A = {a, ¬a}DV: B = {b, ¬b}

Page 29: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

2 x 1 goodness of fit χ² test2 x 1 goodness of fit χ² test• Same as single-sample z test for P (z²

= χ²)• Or Wilson test for p (by inversion)

F

P = p(b)

p p(b | a)

w+ w–

p(b | a)

p(b)

IV: A = {a, ¬a}DV: B = {b, ¬b}

Page 30: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

The single-sample The single-sample zz testtest• Compares an observation with a given value

– Compare p(b | a) with p(b)– A “goodness of fit” test– Identical to a standard 21 χ² test

• Note that p(b) is given– All of the variation is assumed

to be in the estimate of p(b | a)– Could also compare

p(b | ¬a) with p(b) p(b | a)

p(b)

p(b | ¬a)

Page 31: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

zz test for 2 independent test for 2 independent proportionsproportions• Method: combine observed values

– take the difference (subtract) |p1 – p2|– calculate an ‘averaged’ confidence interval

p

p2 = p(b | ¬a)

O1

O2

F

p1 = p(b | a) (Wallis, to appear, b)

Page 32: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

zz test for 2 independent test for 2 independent proportionsproportions• New confidence interval D = |O1 – O2|

– standard deviation s' = p(1 – p) (1/n1 +1/n2)– p = p(b)– compare

z.s' with x = |p1 – p2|

p

D

x

^ ^

^

z.s'

mean x = 00

(Wallis, to appear, b)

Page 33: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!

Page 34: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!• BUT: 21 and 22 tests have different

purposes– 21 goodness of fit compares

single value a with superset A• assumes only a varies

– 22 test compares two valuesa, ¬a within a set A

• both values may vary

A

a

g.o.

f.

2

2 2 2

¬a

IV: A = {a, ¬a}

Page 35: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!• BUT: 21 and 22 tests have different purposes

– 21 goodness of fit compares single value a with superset A

• assumes only a varies– 22 test compares two values

a, ¬a within a set A• both values may vary

• Q: Do we need χ²?

A

a

g.o.

f.

2

2 2 2

¬a

IV: A = {a, ¬a}

Page 36: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Larger χ² testsLarger χ² tests• χ² is popular because it can be applied to

contingency tables with many values• r 1 goodness of fit χ² tests (r 2)• r c χ² tests for homogeneity (r,c 2)

• z tests have 1 degree of freedom• strength: significance is due to only one source• strength: easy to plot values and confidence intervals• weakness: multiple values may be unavoidable

• With larger χ² tests, evaluate and simplify:• Examine χ² contributions for each row or column• Focus on alternation - try to test for a speaker choice

Page 37: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

How big is the effect?How big is the effect?• These tests do not measure the strength of the

interaction between two variables– They test whether the strength of an interaction is

greater than would be expected by chance• With lots of data, a tiny change would be significant

• Don’t use χ², p or z values to compare two different experiments– A result significant at p<0.01 is not ‘better’ than one

significant at p<0.05

• There are a number of ways of measuring ‘association strength’ or ‘effect size’

Page 38: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

How big is the effect?How big is the effect?• Percentage swing

– swing d = p(a | ¬b) – p(a | b)– % swing d

% = d/p(a | b)– frequently used (“X increased by 50%”)

• may have confidence intervals on change• can be misleading (“+50%” then “-50%” is not

zero)– one change, not sequence– over one value, not multiple values

Page 39: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

How big is the effect?How big is the effect?• Percentage swing

– swing d = p(a | ¬b) – p(a | b)– % swing d

% = d/p(a | b)– frequently used (“X increased by 50%”)

• may have confidence intervals on change• can be misleading (“+50%” then “-50%” is not zero)

– one change, not sequence– over one value, not multiple values

• Cramér’s φ = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)

• measures degree of association of one variable with another (across all values)

Page 40: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Comparing experimental Comparing experimental resultsresults• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?

Page 41: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Comparing experimental Comparing experimental resultsresults• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?• Test swings

– z test for two samples from different populations

– Use s' = s12 + s2

2

– Test |d1(a) – d2(a)| > z.s'

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

d1(a) d2(a)

(Wallis 2011)

Page 42: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Comparing experimental Comparing experimental resultsresults• Suppose we have two similar experiments

– How do we test if one result is significantly stronger than another?

• Test swings – z test for two samples from different populations– Use s' = s1

2 + s22

– Test |d1(a) – d2(a)| > z.s'

• Same method can be used to compare other z or χ² tests

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

d1(a) d2(a)

(Wallis 2011)

Page 43: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Modern improvements on Modern improvements on zz

and χ² and χ² • ‘Continuity correction’ for small n

– Yates’ χ2 test – errs on side of caution– can also be applied to Wilson interval

• Newcombe (1998) improves on 22 χ² test– combines two Wilson score intervals– performs better than χ² and log-likelihood (etc.)

for low-frequency events or small samples• However, for corpus linguists, there remains

one outstanding problem...

Page 44: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Experimental designExperimental design• Each observation should be free to

vary– i.e. p can be any value from 0 to 1 p(b | words)

p(b | VPs)

p(b | tensed VPs)

b1 b2

Page 45: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Experimental designExperimental design• Each observation should be free to

vary– i.e. p can be any value from 0 to 1

• However many people use these methods incorrectly– e.g. citation ‘per million words’

• what does this actually mean?

p(b | words)

p(b | VPs)

p(b | tensed VPs)

b1 b2

Page 46: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Experimental designExperimental design• Each observation should be free to vary

– i.e. p can be any value from 0 to 1• However many people use

these methods incorrectly– e.g. citation ‘per million words’

• what does this actually mean?• Baseline should be choice

– Experimentalists can design choice into experiment

– Corpus linguists have to infer when speakers had opportunity to choose, counterfactually

p(b | words)

p(b | VPs)

p(b | tensed VPs)

b1 b2

Page 47: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

A methodological progressionA methodological progression• Aim:

– investigate change when speakers have a choice• Four levels of experimental refinement:

pmw

words

Page 48: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

A methodological progressionA methodological progression• Aim:

– investigate change when speakers have a choice• Four levels of experimental refinement:

pmw select a plausible baseline

words tensed VPs

Page 49: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

A methodological progressionA methodological progression• Aim:

– investigate change when speakers have a choice• Four levels of experimental refinement:

pmw select a plausible baseline

grammatically restrict data

or enumerate cases

words tensed VPs {will, shall}

Page 50: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

A methodological progressionA methodological progression• Aim:

– investigate change when speakers have a choice• Four levels of experimental refinement:

pmw select a plausible baseline

grammatically restrict data

or enumerate cases

check each case

individually for plausibility of alternationwords tensed VPs {will, shall} {will, shall}

Ye shall be saved

Page 51: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

ConclusionsConclusions• The basic idea of these methods is

– Predict future results if experiment were repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)

• Based on the Binomial distribution– Approximated by Normal distribution – many uses

• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare

an observation with an expected baseline• Use 22 tests or two independent sample z tests to

compare two observed samples• When using larger r c tests, simplify as far as possible

to identify the source of variation!• Take care with small samples / low frequencies

– Use Wilson and Newcombe’s methods instead!

Page 52: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

ConclusionsConclusions• Two methods for measuring the ‘size’ of an experimental

effect– absolute or percentage swing– Cramér’s φ

• You can compare two experiments • These methods all presume that

– observed p is free to vary (speaker is free to choose)• If this is not the case then

– statistical model is undermined • confidence intervals are too conservative

– but multiple changes are combined into one• e.g. VPs increase while modals decrease• so significant change may not mean what you think!

Page 53: Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

ReferencesReferences• Newcombe, R.G. 1998. Interval estimation for the difference

between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890

• Wallis, S.A. 2011. Comparing χ² tests for separability. London: Survey of English Usage, UCL

• Wallis, S.A. to appear, a. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics

• Wallis, S.A. to appear, b. z-squared: The origin and use of χ². Journal of Quantitative Linguistics

• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212

• NOTE: My statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com