statistics for variationists - or - what a linguist needs to know about statistics sean wallis...

Statistics for variationistsStatistics for variationists- or -

what a linguist needs to know about statistics

Sean WallisSurvey of English Usage

University College [email protected]

OutlineOutline• What is the point of statistics?

– Variationist corpus linguistics– How inferential statistics works

• Introducing z tests– Two types (single-sample and two-sample)– How these tests are related to χ²

• ‘Effect size’ and comparing results of experiments

• Methodological implications for corpus linguistics

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics ‘in the lab’

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics ‘in the lab’

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

experimental science}

observational science}

philosophy of science}

a little maths}

What is ‘What is ‘inferentialinferential statistics’?statistics’?• Suppose we carry out an experiment

– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

• Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment– We infer from the sample to the population

• Let us consider one type of experiment– Linguistic alternation experiments

Alternation experimentsAlternation experiments• A variationist corpus paradigm• Imagine a speaker forming a sentence as

a series of decisions/choices. They can– add: choose to extend a phrase or clause, or

stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

Alternation experimentsAlternation experiments• A variationist corpus paradigm• Imagine a speaker forming a sentence as a series

of decisions/choices. They can– add: choose to extend a phrase or clause, or stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

• Research question: – within these constraints,

what factors influence the particular choice?

Alternation experimentsAlternation experiments• Laboratory experiment (cued)

– pose the choice to subjects – observe the one they make– manipulate different potential influences

• Observational experiment (uncued)– observe the choices speakers make when they make

them (e.g. in a corpus)– extract data for different potential influences

• sociolinguistic: subdivide data by genre, etc• lexical/grammatical: subdivide data by elements in

surrounding context– BUT the alternate choice is counterfactual

Statistical assumptionsStatistical assumptionsA random sample taken from the population

– Not always easy to achieve• multiple cases from the same text and speakers, etc• may be limited historical data available

– Be careful with data concentrated in a few textsThe sample is tiny compared to the population

– This is easy to satisfy in linguistics!Observations are free to vary (alternate)Repeated sampling tends to form a Binomial

distribution around the expected mean– This requires slightly more explanation...

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distribution around the expected mean P

F

N = 1

x

531 7 9

• We toss a coin 10 times, and get 5 heads

P

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial

distribution around the expected mean P

F

N = 4

x

531 7 9

• Due to chance, some samples will have a higher or lower score

P



F

N = 8

x

531 7 9


P



F

N = 12

x

531 7 9


P



F

N = 16

x

531 7 9


P



F

N = 20

x

531 7 9


P



F

N = 24

x

531 7 9


P

Binomial Binomial Normal Normal• The Binomial (discrete) distribution is close to

the Normal (continuous) distribution

x

F

531 7 9

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

– With more data in the experiment, s will be smaller

p0.50.30.1 0.7

– Divide x by 10 for probability scale

population

mean P

standard deviations = P(1 – P) / n



z . s z . s

F

2.5% 2.5%

population

mean P

– 95% of the curve is within ~2 standard deviations of the expected mean


p0.50.30.1 0.7

95%

– the correct figure is 1.95996!

= the critical value of z for an error level of 0.05.



z . s z . s

F

2.5% 2.5%

population

mean P


p0.50.30.1 0.7

95%

= the critical value of z for an error level of 0.05.

z/2

The single-sample The single-sample zz test...test...• Is an observation p > z standard deviations

from the expected (population) mean P?

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p• If yes, p is

significantly different from P

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– We want to plot the interval about p

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– We want to plot the interval about p

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval• This interval is

asymmetric• It reflects the

Normal interval about P:

• If P is at the upper limit of p,p is at the lower limit of P

(Wallis, to appear, a)

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval• To calculate w–

and w+ we use this formula:

nz

nz

nppz

nzp

2

2

22

1

4)1(

2

(Wilson, 1927)

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

Plotting confidence intervalsPlotting confidence intervals• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})• Small amounts

of data / year• Highly skewed p

in some cases– p = 0 or 1

(circled)• Confidence

intervals identify the degree of certainty in our results

(Wallis, to appear, a)

Plotting confidence intervalsPlotting confidence intervals• Probability of adding successive

attributive adjective phrases (AJPs) to a NP in ICE-GB

– x = number of AJPs • NPs get longer adding AJPs is more difficult

• The first two falls are significant, the last is not0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4

p

x

2 x 1 goodness of fit χ² test2 x 1 goodness of fit χ² test• Same as single-sample z test for P (z²

= χ²)– Does the value of a affect p(b)?

z . s z . s

F

P = p(b)

p

p(b | a)

p(b | a)

p(b)

IV: A = {a, ¬a}DV: B = {b, ¬b}

2 x 1 goodness of fit χ² test2 x 1 goodness of fit χ² test• Same as single-sample z test for P (z²

= χ²)• Or Wilson test for p (by inversion)

F

P = p(b)

p p(b | a)

w+ w–

p(b | a)

p(b)

IV: A = {a, ¬a}DV: B = {b, ¬b}

The single-sample The single-sample zz testtest• Compares an observation with a given value

– Compare p(b | a) with p(b)– A “goodness of fit” test– Identical to a standard 21 χ² test

• Note that p(b) is given– All of the variation is assumed

to be in the estimate of p(b | a)– Could also compare

p(b | ¬a) with p(b) p(b | a)

p(b)

p(b | ¬a)

zz test for 2 independent test for 2 independent proportionsproportions• Method: combine observed values

– take the difference (subtract) |p1 – p2|– calculate an ‘averaged’ confidence interval

p

p2 = p(b | ¬a)

O1

O2

F

p1 = p(b | a) (Wallis, to appear, b)

zz test for 2 independent test for 2 independent proportionsproportions• New confidence interval D = |O1 – O2|

– standard deviation s' = p(1 – p) (1/n1 +1/n2)– p = p(b)– compare

z.s' with x = |p1 – p2|

p

D

x

^ ^

^

z.s'

mean x = 00

(Wallis, to appear, b)

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!


– So you can use the usual method!• BUT: 21 and 22 tests have different

purposes– 21 goodness of fit compares

single value a with superset A• assumes only a varies

– 22 test compares two valuesa, ¬a within a set A

• both values may vary

A

a

g.o.

f.

2

2 2 2

¬a

IV: A = {a, ¬a}


– So you can use the usual method!• BUT: 21 and 22 tests have different purposes

– 21 goodness of fit compares single value a with superset A

• assumes only a varies– 22 test compares two values

a, ¬a within a set A• both values may vary

• Q: Do we need χ²?

A

a

g.o.

f.

2

2 2 2

¬a

IV: A = {a, ¬a}

Larger χ² testsLarger χ² tests• χ² is popular because it can be applied to

contingency tables with many values• r 1 goodness of fit χ² tests (r 2)• r c χ² tests for homogeneity (r,c 2)

• z tests have 1 degree of freedom• strength: significance is due to only one source• strength: easy to plot values and confidence intervals• weakness: multiple values may be unavoidable

• With larger χ² tests, evaluate and simplify:• Examine χ² contributions for each row or column• Focus on alternation - try to test for a speaker choice

How big is the effect?How big is the effect?• These tests do not measure the strength of the

interaction between two variables– They test whether the strength of an interaction is

greater than would be expected by chance• With lots of data, a tiny change would be significant

• Don’t use χ², p or z values to compare two different experiments– A result significant at p<0.01 is not ‘better’ than one

significant at p<0.05

• There are a number of ways of measuring ‘association strength’ or ‘effect size’

How big is the effect?How big is the effect?• Percentage swing

– swing d = p(a | ¬b) – p(a | b)– % swing d

% = d/p(a | b)– frequently used (“X increased by 50%”)

• may have confidence intervals on change• can be misleading (“+50%” then “-50%” is not

zero)– one change, not sequence– over one value, not multiple values

How big is the effect?How big is the effect?• Percentage swing

– swing d = p(a | ¬b) – p(a | b)– % swing d

% = d/p(a | b)– frequently used (“X increased by 50%”)

• may have confidence intervals on change• can be misleading (“+50%” then “-50%” is not zero)

– one change, not sequence– over one value, not multiple values

• Cramér’s φ = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)

• measures degree of association of one variable with another (across all values)

Comparing experimental Comparing experimental resultsresults• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?

Comparing experimental Comparing experimental resultsresults• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?• Test swings

– z test for two samples from different populations

– Use s' = s12 + s2

2

– Test |d1(a) – d2(a)| > z.s'

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

d1(a) d2(a)

(Wallis 2011)

Comparing experimental Comparing experimental resultsresults• Suppose we have two similar experiments

– How do we test if one result is significantly stronger than another?

• Test swings – z test for two samples from different populations– Use s' = s1

2 + s22

– Test |d1(a) – d2(a)| > z.s'

• Same method can be used to compare other z or χ² tests

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

d1(a) d2(a)

(Wallis 2011)

Modern improvements on Modern improvements on zz

and χ² and χ² • ‘Continuity correction’ for small n

– Yates’ χ2 test – errs on side of caution– can also be applied to Wilson interval

• Newcombe (1998) improves on 22 χ² test– combines two Wilson score intervals– performs better than χ² and log-likelihood (etc.)

for low-frequency events or small samples• However, for corpus linguists, there remains

one outstanding problem...

Experimental designExperimental design• Each observation should be free to

vary– i.e. p can be any value from 0 to 1 p(b | words)

p(b | VPs)

p(b | tensed VPs)

b1 b2

Experimental designExperimental design• Each observation should be free to

vary– i.e. p can be any value from 0 to 1

• However many people use these methods incorrectly– e.g. citation ‘per million words’

• what does this actually mean?

p(b | words)

p(b | VPs)

p(b | tensed VPs)

b1 b2

Experimental designExperimental design• Each observation should be free to vary

– i.e. p can be any value from 0 to 1• However many people use

these methods incorrectly– e.g. citation ‘per million words’

• what does this actually mean?• Baseline should be choice

– Experimentalists can design choice into experiment

– Corpus linguists have to infer when speakers had opportunity to choose, counterfactually

p(b | words)

p(b | VPs)

p(b | tensed VPs)

b1 b2

A methodological progressionA methodological progression• Aim:

– investigate change when speakers have a choice• Four levels of experimental refinement:

pmw

words



pmw select a plausible baseline

words tensed VPs




grammatically restrict data

or enumerate cases

words tensed VPs {will, shall}




grammatically restrict data

or enumerate cases

check each case

individually for plausibility of alternationwords tensed VPs {will, shall} {will, shall}

Ye shall be saved

ConclusionsConclusions• The basic idea of these methods is

– Predict future results if experiment were repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)

• Based on the Binomial distribution– Approximated by Normal distribution – many uses

• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare

an observation with an expected baseline• Use 22 tests or two independent sample z tests to

compare two observed samples• When using larger r c tests, simplify as far as possible

to identify the source of variation!• Take care with small samples / low frequencies

– Use Wilson and Newcombe’s methods instead!

ConclusionsConclusions• Two methods for measuring the ‘size’ of an experimental

effect– absolute or percentage swing– Cramér’s φ

• You can compare two experiments • These methods all presume that

– observed p is free to vary (speaker is free to choose)• If this is not the case then

– statistical model is undermined • confidence intervals are too conservative

– but multiple changes are combined into one• e.g. VPs increase while modals decrease• so significant change may not mean what you think!

ReferencesReferences• Newcombe, R.G. 1998. Interval estimation for the difference

between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890

• Wallis, S.A. 2011. Comparing χ² tests for separability. London: Survey of English Usage, UCL

• Wallis, S.A. to appear, a. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics

• Wallis, S.A. to appear, b. z-squared: The origin and use of χ². Journal of Quantitative Linguistics

• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212

• NOTE: My statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com

statistics for variationists - or - what a linguist needs to know about statistics sean wallis...

Documents