stefan evert, ims - uni stuttgart brigitte krenn, Öfai wien ims the significance of result...

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Significance of Result Differences

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Why Significance Tests?

• everybody knows we have to test the significance of our results• but do we really?

• evaluation results are valid for• data from specific corpus• extracted with specific methods• for a particular type of collocations• according to the intuitions of one

particular annotator (or two)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Why Significance Tests?

• significance tests are about generalisations

• basic question:"If we repeated the evaluation experiment (on similar data), would we get the same results?"

• influence of source corpus, domain, collocation type and definition, annotation guidelines, ...

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Evaluation of Association Measures

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

A Different Perspective

• pair types are described by tables (O11, O12, O21, O22) coordinates in 4-D space

• O22 is redundant becauseO11 + O12 + O21 + O22 = N

• can also describe pair type by joint and marginal frequencies(f, f1, f2) = "coordinates" coordinates in 3-D space

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

A Different Perspective

• data set = cloud of points in three-dimensional space

• visualisation is "challenging"• many association measures

depend on O11 and E11 only(MI, gmean, t-score, binomial)

• projection to (O11, E11) coordinates in 2-D space(ignoring the ratio f1 / f2)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Parameter Space of Collocation Candidates

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-best Lists in Parameter Space

• N-best List for AM includes all pair types where score c(threshold c obtained from data)

• { c} describes a subset of the parameter space

• for a sound association measure isoline { = c} is lower boundary(because scores should increase with O11 for fixed value of E11)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in the Parameter Space

MI

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in theParameter Space

MI

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in theParameter Space

t-score

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

95% Confidence Interval

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Precision Values

• number of TPs and FPs for 1000-best lists

tbl t-score frequency

TPs 322 283

FPs 678 717

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

McNemar's Test

+ = in 1000-best list – = not in 1000-best list• ideally: all TPs in 1000-best list (possible!)

• H0: differences between AMs are random

tbl – t-score + t-score

– freq 610 46

+ freq 7 276

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

McNemar's Test

+ = in 1000-best list – = not in 1000-best list> mcnemar.test(tbl)

• p-value < 0.001 highly significant

tbl – t-score + t-score

– freq 610 46

+ freq 7 276

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Significant Differences

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Significant Differences

= significant = relevant (2%)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Lowest-Frequency Data: Samples

• Too much data for full manual evaluation random samples

• AdjN data• 965 pairs with f = 1 (15% sample)• manually identified 31 TPs (3.2%)

• PNV data• 983 pairs with f < 3 (0.35% sample)• manually identified 6 TPs (0.6%)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Lowest-Frequency Data: Samples

• Estimate proportion p of TPs among all lowest-frequency data

• Confidence set from binomial test• AdjN: 31 TPs among 965 items

• p 5% with 99% confidence• at most 320 TPs

• PNV: 6 TPs among 983-items • p 1.5% with 99% confidence• there might still be 4200 TPs !!

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-best Lists for Lowest-Frequency Data

• evaluate 10,000-best lists• to reduce manual annotation work,

take 10% sample from each list(i.e. 1,000 candidates for each AM)

• precision graphs for N-best lists• up to N = 10,000 for the PNV data

• 95% confidence estimates for precision of best-performing AM (from binomial test)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Random Sample Evaluation

stefan evert, ims - uni stuttgart brigitte krenn, Öfai wien ims the significance of result...

Documents

stefan evert

fai wien ims mcnemars

fai wien ims nbest isolines

fai wien ims nbest lists

d space slide

freq7276 slide

parameter space mi slide

parameter space tscore