stefan evert, ims - uni stuttgart brigitte krenn, Öfai wien ims the significance of result...
TRANSCRIPT
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Significance of Result Differences
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Why Significance Tests?
• everybody knows we have to test the significance of our results• but do we really?
• evaluation results are valid for• data from specific corpus• extracted with specific methods• for a particular type of collocations• according to the intuitions of one
particular annotator (or two)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Why Significance Tests?
• significance tests are about generalisations
• basic question:"If we repeated the evaluation experiment (on similar data), would we get the same results?"
• influence of source corpus, domain, collocation type and definition, annotation guidelines, ...
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Evaluation of Association Measures
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Evaluation of Association Measures
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
A Different Perspective
• pair types are described by tables (O11, O12, O21, O22) coordinates in 4-D space
• O22 is redundant becauseO11 + O12 + O21 + O22 = N
• can also describe pair type by joint and marginal frequencies(f, f1, f2) = "coordinates" coordinates in 3-D space
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
A Different Perspective
• data set = cloud of points in three-dimensional space
• visualisation is "challenging"• many association measures
depend on O11 and E11 only(MI, gmean, t-score, binomial)
• projection to (O11, E11) coordinates in 2-D space(ignoring the ratio f1 / f2)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Parameter Space of Collocation Candidates
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Parameter Space of Collocation Candidates
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Parameter Space of Collocation Candidates
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Parameter Space of Collocation Candidates
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Parameter Space of Collocation Candidates
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
N-best Lists in Parameter Space
• N-best List for AM includes all pair types where score c(threshold c obtained from data)
• { c} describes a subset of the parameter space
• for a sound association measure isoline { = c} is lower boundary(because scores should increase with O11 for fixed value of E11)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
N-Best Isolines in the Parameter Space
MI
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
N-Best Isolines in theParameter Space
MI
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
N-Best Isolines in theParameter Space
t-score
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
N-Best Isolines in theParameter Space
t-score
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
95% Confidence Interval
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
99% Confidence Interval
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
95% Confidence Interval
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Precision Values
• number of TPs and FPs for 1000-best lists
tbl t-score frequency
TPs 322 283
FPs 678 717
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
McNemar's Test
+ = in 1000-best list – = not in 1000-best list• ideally: all TPs in 1000-best list (possible!)
• H0: differences between AMs are random
tbl – t-score + t-score
– freq 610 46
+ freq 7 276
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
McNemar's Test
+ = in 1000-best list – = not in 1000-best list> mcnemar.test(tbl)
• p-value < 0.001 highly significant
tbl – t-score + t-score
– freq 610 46
+ freq 7 276
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Significant Differences
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Significant Differences
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Significant Differences
= significant = relevant (2%)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Lowest-Frequency Data: Samples
• Too much data for full manual evaluation random samples
• AdjN data• 965 pairs with f = 1 (15% sample)• manually identified 31 TPs (3.2%)
• PNV data• 983 pairs with f < 3 (0.35% sample)• manually identified 6 TPs (0.6%)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Lowest-Frequency Data: Samples
• Estimate proportion p of TPs among all lowest-frequency data
• Confidence set from binomial test• AdjN: 31 TPs among 965 items
• p 5% with 99% confidence• at most 320 TPs
• PNV: 6 TPs among 983-items • p 1.5% with 99% confidence• there might still be 4200 TPs !!
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
N-best Lists for Lowest-Frequency Data
• evaluate 10,000-best lists• to reduce manual annotation work,
take 10% sample from each list(i.e. 1,000 candidates for each AM)
• precision graphs for N-best lists• up to N = 10,000 for the PNV data
• 95% confidence estimates for precision of best-performing AM (from binomial test)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Random Sample Evaluation
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Random Sample Evaluation
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Random Sample Evaluation