statistics for variationists - or - what a linguist needs to know about statistics sean wallis...
DESCRIPTION
What is the point of statistics? Analyse data you already have –corpus linguistics Design new experiments –collect new data, add annotation –experimental linguistics ‘in the lab’ Try new methods –pose the right question We are going to focus on z and χ² testsTRANSCRIPT
Statistics for variationistsStatistics for variationists- or -
what a linguist needs to know about statistics
Sean WallisSurvey of English Usage
University College [email protected]
OutlineOutline• What is the point of statistics?
– Variationist corpus linguistics– How inferential statistics works
• Introducing z tests– Two types (single-sample and two-sample)– How these tests are related to χ²
• ‘Effect size’ and comparing results of experiments
• Methodological implications for corpus linguistics
What is the point of statistics?What is the point of statistics?• Analyse data you already have
– corpus linguistics• Design new experiments
– collect new data, add annotation– experimental linguistics ‘in the lab’
• Try new methods– pose the right question
• We are going to focus onz and χ² tests
What is the point of statistics?What is the point of statistics?• Analyse data you already have
– corpus linguistics• Design new experiments
– collect new data, add annotation– experimental linguistics ‘in the lab’
• Try new methods– pose the right question
• We are going to focus onz and χ² tests
experimental science}
observational science}
philosophy of science}
a little maths}
What is ‘What is ‘inferentialinferential statistics’?statistics’?• Suppose we carry out an experiment
– We toss a coin 10 times and get 5 heads– How confident are we in the results?
• Suppose we repeat the experiment• Will we get the same result again?
• Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment– We infer from the sample to the population
• Let us consider one type of experiment– Linguistic alternation experiments
Alternation experimentsAlternation experiments• A variationist corpus paradigm• Imagine a speaker forming a sentence as
a series of decisions/choices. They can– add: choose to extend a phrase or clause, or
stop– select: choose between constructions
• Choices will be constrained – grammatically– semantically
Alternation experimentsAlternation experiments• A variationist corpus paradigm• Imagine a speaker forming a sentence as a series
of decisions/choices. They can– add: choose to extend a phrase or clause, or stop– select: choose between constructions
• Choices will be constrained – grammatically– semantically
• Research question: – within these constraints,
what factors influence the particular choice?
Alternation experimentsAlternation experiments• Laboratory experiment (cued)
– pose the choice to subjects – observe the one they make– manipulate different potential influences
• Observational experiment (uncued)– observe the choices speakers make when they make
them (e.g. in a corpus)– extract data for different potential influences
• sociolinguistic: subdivide data by genre, etc• lexical/grammatical: subdivide data by elements in
surrounding context– BUT the alternate choice is counterfactual
Statistical assumptionsStatistical assumptionsA random sample taken from the population
– Not always easy to achieve• multiple cases from the same text and speakers, etc• may be limited historical data available
– Be careful with data concentrated in a few textsThe sample is tiny compared to the population
– This is easy to satisfy in linguistics!Observations are free to vary (alternate)Repeated sampling tends to form a Binomial
distribution around the expected mean– This requires slightly more explanation...
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a
Binomial distribution around the expected mean P
F
N = 1
x
531 7 9
• We toss a coin 10 times, and get 5 heads
P
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
N = 4
x
531 7 9
• Due to chance, some samples will have a higher or lower score
P
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
N = 8
x
531 7 9
• Due to chance, some samples will have a higher or lower score
P
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
N = 12
x
531 7 9
• Due to chance, some samples will have a higher or lower score
P
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
N = 16
x
531 7 9
• Due to chance, some samples will have a higher or lower score
P
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
N = 20
x
531 7 9
• Due to chance, some samples will have a higher or lower score
P
The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
N = 24
x
531 7 9
• Due to chance, some samples will have a higher or lower score
P
Binomial Binomial Normal Normal• The Binomial (discrete) distribution is close to
the Normal (continuous) distribution
x
F
531 7 9
The central limit theoremThe central limit theorem• Any Normal distribution can be defined by
only two variables and the Normal function z
z . s z . s
F
– With more data in the experiment, s will be smaller
p0.50.30.1 0.7
– Divide x by 10 for probability scale
population
mean P
standard deviations = P(1 – P) / n
The central limit theoremThe central limit theorem• Any Normal distribution can be defined by
only two variables and the Normal function z
z . s z . s
F
2.5% 2.5%
population
mean P
– 95% of the curve is within ~2 standard deviations of the expected mean
standard deviations = P(1 – P) / n
p0.50.30.1 0.7
95%
– the correct figure is 1.95996!
= the critical value of z for an error level of 0.05.
The central limit theoremThe central limit theorem• Any Normal distribution can be defined by
only two variables and the Normal function z
z . s z . s
F
2.5% 2.5%
population
mean P
standard deviations = P(1 – P) / n
p0.50.30.1 0.7
95%
= the critical value of z for an error level of 0.05.
z/2
The single-sample The single-sample zz test...test...• Is an observation p > z standard deviations
from the expected (population) mean P?
z . s z . s
F
P0.25% 0.25%
p0.50.30.1 0.7
observation p• If yes, p is
significantly different from P
...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P
– We want to plot the interval about p
z . s z . s
F
P0.25% 0.25%
p0.50.30.1 0.7
...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P
– We want to plot the interval about p
w+
F
P0.25% 0.25%
p0.50.30.1 0.7
observation p
w–
...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the
Wilson score interval• This interval is
asymmetric• It reflects the
Normal interval about P:
• If P is at the upper limit of p,p is at the lower limit of P
(Wallis, to appear, a)
w+
F
P0.25% 0.25%
p0.50.30.1 0.7
observation p
w–
...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the
Wilson score interval• To calculate w–
and w+ we use this formula:
nz
nz
nppz
nzp
2
2
22
1
4)1(
2
(Wilson, 1927)
w+
F
P0.25% 0.25%
p0.50.30.1 0.7
observation p
w–
Plotting confidence intervalsPlotting confidence intervals• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})• Small amounts
of data / year• Highly skewed p
in some cases– p = 0 or 1
(circled)• Confidence
intervals identify the degree of certainty in our results
(Wallis, to appear, a)
Plotting confidence intervalsPlotting confidence intervals• Probability of adding successive
attributive adjective phrases (AJPs) to a NP in ICE-GB
– x = number of AJPs • NPs get longer adding AJPs is more difficult
• The first two falls are significant, the last is not0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4
p
x
2 x 1 goodness of fit χ² test2 x 1 goodness of fit χ² test• Same as single-sample z test for P (z²
= χ²)– Does the value of a affect p(b)?
z . s z . s
F
P = p(b)
p
p(b | a)
p(b | a)
p(b)
IV: A = {a, ¬a}DV: B = {b, ¬b}
2 x 1 goodness of fit χ² test2 x 1 goodness of fit χ² test• Same as single-sample z test for P (z²
= χ²)• Or Wilson test for p (by inversion)
F
P = p(b)
p p(b | a)
w+ w–
p(b | a)
p(b)
IV: A = {a, ¬a}DV: B = {b, ¬b}
The single-sample The single-sample zz testtest• Compares an observation with a given value
– Compare p(b | a) with p(b)– A “goodness of fit” test– Identical to a standard 21 χ² test
• Note that p(b) is given– All of the variation is assumed
to be in the estimate of p(b | a)– Could also compare
p(b | ¬a) with p(b) p(b | a)
p(b)
p(b | ¬a)
zz test for 2 independent test for 2 independent proportionsproportions• Method: combine observed values
– take the difference (subtract) |p1 – p2|– calculate an ‘averaged’ confidence interval
p
p2 = p(b | ¬a)
O1
O2
F
p1 = p(b | a) (Wallis, to appear, b)
zz test for 2 independent test for 2 independent proportionsproportions• New confidence interval D = |O1 – O2|
– standard deviation s' = p(1 – p) (1/n1 +1/n2)– p = p(b)– compare
z.s' with x = |p1 – p2|
p
D
x
^ ^
^
z.s'
mean x = 00
(Wallis, to appear, b)
zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test
– So you can use the usual method!
zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test
– So you can use the usual method!• BUT: 21 and 22 tests have different
purposes– 21 goodness of fit compares
single value a with superset A• assumes only a varies
– 22 test compares two valuesa, ¬a within a set A
• both values may vary
A
a
g.o.
f.
2
2 2 2
¬a
IV: A = {a, ¬a}
zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test
– So you can use the usual method!• BUT: 21 and 22 tests have different purposes
– 21 goodness of fit compares single value a with superset A
• assumes only a varies– 22 test compares two values
a, ¬a within a set A• both values may vary
• Q: Do we need χ²?
A
a
g.o.
f.
2
2 2 2
¬a
IV: A = {a, ¬a}
Larger χ² testsLarger χ² tests• χ² is popular because it can be applied to
contingency tables with many values• r 1 goodness of fit χ² tests (r 2)• r c χ² tests for homogeneity (r,c 2)
• z tests have 1 degree of freedom• strength: significance is due to only one source• strength: easy to plot values and confidence intervals• weakness: multiple values may be unavoidable
• With larger χ² tests, evaluate and simplify:• Examine χ² contributions for each row or column• Focus on alternation - try to test for a speaker choice
How big is the effect?How big is the effect?• These tests do not measure the strength of the
interaction between two variables– They test whether the strength of an interaction is
greater than would be expected by chance• With lots of data, a tiny change would be significant
• Don’t use χ², p or z values to compare two different experiments– A result significant at p<0.01 is not ‘better’ than one
significant at p<0.05
• There are a number of ways of measuring ‘association strength’ or ‘effect size’
How big is the effect?How big is the effect?• Percentage swing
– swing d = p(a | ¬b) – p(a | b)– % swing d
% = d/p(a | b)– frequently used (“X increased by 50%”)
• may have confidence intervals on change• can be misleading (“+50%” then “-50%” is not
zero)– one change, not sequence– over one value, not multiple values
How big is the effect?How big is the effect?• Percentage swing
– swing d = p(a | ¬b) – p(a | b)– % swing d
% = d/p(a | b)– frequently used (“X increased by 50%”)
• may have confidence intervals on change• can be misleading (“+50%” then “-50%” is not zero)
– one change, not sequence– over one value, not multiple values
• Cramér’s φ = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)
• measures degree of association of one variable with another (across all values)
Comparing experimental Comparing experimental resultsresults• Suppose we have two similar
experiments– How do we test if one result is significantly
stronger than another?
Comparing experimental Comparing experimental resultsresults• Suppose we have two similar
experiments– How do we test if one result is significantly
stronger than another?• Test swings
– z test for two samples from different populations
– Use s' = s12 + s2
2
– Test |d1(a) – d2(a)| > z.s'
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
d1(a) d2(a)
(Wallis 2011)
Comparing experimental Comparing experimental resultsresults• Suppose we have two similar experiments
– How do we test if one result is significantly stronger than another?
• Test swings – z test for two samples from different populations– Use s' = s1
2 + s22
– Test |d1(a) – d2(a)| > z.s'
• Same method can be used to compare other z or χ² tests
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
d1(a) d2(a)
(Wallis 2011)
Modern improvements on Modern improvements on zz
and χ² and χ² • ‘Continuity correction’ for small n
– Yates’ χ2 test – errs on side of caution– can also be applied to Wilson interval
• Newcombe (1998) improves on 22 χ² test– combines two Wilson score intervals– performs better than χ² and log-likelihood (etc.)
for low-frequency events or small samples• However, for corpus linguists, there remains
one outstanding problem...
Experimental designExperimental design• Each observation should be free to
vary– i.e. p can be any value from 0 to 1 p(b | words)
p(b | VPs)
p(b | tensed VPs)
b1 b2
Experimental designExperimental design• Each observation should be free to
vary– i.e. p can be any value from 0 to 1
• However many people use these methods incorrectly– e.g. citation ‘per million words’
• what does this actually mean?
p(b | words)
p(b | VPs)
p(b | tensed VPs)
b1 b2
Experimental designExperimental design• Each observation should be free to vary
– i.e. p can be any value from 0 to 1• However many people use
these methods incorrectly– e.g. citation ‘per million words’
• what does this actually mean?• Baseline should be choice
– Experimentalists can design choice into experiment
– Corpus linguists have to infer when speakers had opportunity to choose, counterfactually
p(b | words)
p(b | VPs)
p(b | tensed VPs)
b1 b2
A methodological progressionA methodological progression• Aim:
– investigate change when speakers have a choice• Four levels of experimental refinement:
pmw
words
A methodological progressionA methodological progression• Aim:
– investigate change when speakers have a choice• Four levels of experimental refinement:
pmw select a plausible baseline
words tensed VPs
A methodological progressionA methodological progression• Aim:
– investigate change when speakers have a choice• Four levels of experimental refinement:
pmw select a plausible baseline
grammatically restrict data
or enumerate cases
words tensed VPs {will, shall}
A methodological progressionA methodological progression• Aim:
– investigate change when speakers have a choice• Four levels of experimental refinement:
pmw select a plausible baseline
grammatically restrict data
or enumerate cases
check each case
individually for plausibility of alternationwords tensed VPs {will, shall} {will, shall}
Ye shall be saved
ConclusionsConclusions• The basic idea of these methods is
– Predict future results if experiment were repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)
• Based on the Binomial distribution– Approximated by Normal distribution – many uses
• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare
an observation with an expected baseline• Use 22 tests or two independent sample z tests to
compare two observed samples• When using larger r c tests, simplify as far as possible
to identify the source of variation!• Take care with small samples / low frequencies
– Use Wilson and Newcombe’s methods instead!
ConclusionsConclusions• Two methods for measuring the ‘size’ of an experimental
effect– absolute or percentage swing– Cramér’s φ
• You can compare two experiments • These methods all presume that
– observed p is free to vary (speaker is free to choose)• If this is not the case then
– statistical model is undermined • confidence intervals are too conservative
– but multiple changes are combined into one• e.g. VPs increase while modals decrease• so significant change may not mean what you think!
ReferencesReferences• Newcombe, R.G. 1998. Interval estimation for the difference
between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890
• Wallis, S.A. 2011. Comparing χ² tests for separability. London: Survey of English Usage, UCL
• Wallis, S.A. to appear, a. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics
• Wallis, S.A. to appear, b. z-squared: The origin and use of χ². Journal of Quantitative Linguistics
• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212
• NOTE: My statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com