rls for emnlp 2008

40
Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations for Natural Language Tasks

Upload: guest60b48a

Post on 16-Jan-2015

4.998 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Rls For Emnlp 2008

Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng

Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations

for Natural Language Tasks

Page 2: Rls For Emnlp 2008

The primacy of data

(Banko and Brill, 2001): Scaling to Very Very Large Corpora

for Natural Language Disambiguation

Page 3: Rls For Emnlp 2008

Datasets drive research

statistical parsing

speech recognition

semantic role labeling

statistical machine

translation

Penn TreebankPropBank

Switchboard

UN Parallel TextPascal RTE

textual entailment

word sensedisambiguation

WordNetSemCor

Page 4: Rls For Emnlp 2008

The advent of human computation

• Open Mind Common Sense (Singh et al., 2002)

• Games with a Purpose (von Ahn and Dabbish, 2004)

• Online Word Games (Vickrey et al., 2008)

Page 5: Rls For Emnlp 2008

Amazon Mechanical TurkBut what if your task isn’t “fun”?

mturk.com

Page 6: Rls For Emnlp 2008

Using AMT for dataset creation

• Su et al. (2007): name resolution, attribute extraction

• Nakov (2008): paraphrasing noun compounds

• Kaisser and Lowe (2008): sentence-level QA annotation

• Kaisser et al. (2008): customizing QA summary length

• Zaenen (2008): evaluating RTE agreement

Page 7: Rls For Emnlp 2008

Using AMT is cheap

Paper Labels Cents/Label

Su et al. (2007) 10,500 1.5

Nakov (2008) 19,018 unreported

Kaisser and Lowe (2008) 24,321 2.0

Kaisser et al. (2008) 45,300 3.7

Zaenen (2008) 4,000 2.0

Page 8: Rls For Emnlp 2008

And it’s fast...

blog.doloreslabs.com

Page 9: Rls For Emnlp 2008

But is it good?

• Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data

• Method: pick 5 standard datasets, and relabel each point with 10 new annotations

• Compare Turk agreement to dataset with reported expert interannotator agreement

Page 10: Rls For Emnlp 2008

Tasks• Affect recognition

• Strapparava and Mihalcea (2007)

• Word Similarity

• Miller and Charles (1991)

• Textual Entailment

• Dagan et al. (2006)

• WSD

• Pradhan et al. (2007)

• Temporal Annotation

• Pustejovsky et al. (2003)

sim(boy, lad) > sim(rooster, noon)

ran happens before fell in:“The horse ran past the barn fell.”

“a bass on the line” vs. “a funky bass line”

if “Microsoft was established in Italy in 1985”,then “Microsoft was established in 1985” ?

fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”)

Page 11: Rls For Emnlp 2008

TasksTask

Expert Labelers

Unique Examples

Interannotator Agreement

Answer Type

Affect Recognition

6 700 0.603 numeric

Word Similarity

1 30 0.958 numeric

Textual Entailment

1 800 0.91 binary

Temporal Annotation

1 462 Unknown binary

WSD 1 177 Unknown ternary

Page 12: Rls For Emnlp 2008

Affect Recognition

Page 13: Rls For Emnlp 2008

Interannotator Agreement

• 6 total experts.

• One expert’s ITA is calculated as the average of Pearson correlations from each annotator to the avg. of the other 5 annotators.

Emotion 1-E ITA

Anger 0.459

Disgust 0.583

Fear 0.711

Joy 0.596

Sadness 0.645

Surprise 0.464

Valence 0.844

All 0.603

Page 14: Rls For Emnlp 2008

Nonexpert ITAWe average over k annotations to create a single “proto-labeler”.

We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.

Page 15: Rls For Emnlp 2008

Interannotator AgreementEmotion 1-E ITA 10-N ITA

Anger 0.459 0.675

Disgust 0.583 0.746

Fear 0.711 0.689

Joy 0.596 0.632

Sadness 0.645 0.776

Surprise 0.464 0.496

Valence 0.844 0.669

All 0.603 0.694

2 4 6 8 100.45

0.55

0.65

correlation

anger

2 4 6 8 10

0.55

0.65

0.75

correlation

disgust

2 4 6 8 100.40

0.50

0.60

0.70

correlation

fear

2 4 6 8 10

0.35

0.45

0.55

0.65

correlation

joy

2 4 6 8 100.55

0.65

0.75

annotators

correlation

sadness

2 4 6 8 100.20

0.30

0.40

0.50

annotators

correlation

surprise

Number of nonexpert annotators required to match expert ITA, on average: 4

Page 16: Rls For Emnlp 2008

Task 1-E ITA 10-N ITA

Affect Recognition 0.603 0.694

Word Similarity

0.958 0.952

Textual Entailment 0.91 0.897

Temporal Annotation

0.940

WSD 0.994

2 4 6 8 10

0.84

0.90

0.96

correlation

word similarity

2 4 6 8 100.70

0.80

0.90

accuracy

RTE

2 4 6 8 100.70

0.80

0.90

annotators

accuracy

before/after

2 4 6 8 100.980

0.990

1.000

annotators

accuracy

WSD

Interannotator Agreement

Page 17: Rls For Emnlp 2008

Error Analysis: WSDonly 1 “mistake” out of 177 labels:

“The Egyptian president said he would visit Libya today...”

Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.

Page 18: Rls For Emnlp 2008

Error Analysis: RTE

• Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.”

• Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/

~10 disagreements out of 100:

Close Examples

T: “Google files for its long awaited IPO.”

H: “Google goes public.”

Labeled “TRUE” in PASCAL RTE-1,Turkers vote 6-4 “FALSE”.

T: A car bomb that exploded outside a U.S. military base near Beiji, killed 11 Iraqis.

H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis.

Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”.

Page 19: Rls For Emnlp 2008

Weighting Annotators

• There are a small number of very prolific, very noisy annotators. If we plot each annotator:

0 200 400 600 800

0.4

0.6

0.8

1.0

number of annotations

accu

racy

Task: RTE

• We should be able to do better than majority voting.

Page 20: Rls For Emnlp 2008

Weighting Annotators

• To infer the true value xi, we weight each response yi from annotator w using a small gold standard training set:

• We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.

Page 21: Rls For Emnlp 2008

Weighting Annotators

annotators

accu

racy

0.7

0.8

0.9

RTE

annotators

0.7

0.8

0.9

before/after

Gold calibratedNaive voting

• Several follow-up posts at http://lingpipe-blog.com

RTE: 4.0% avg. accuracy increase

Temporal: 3.4% avg. accuracy increase

Page 22: Rls For Emnlp 2008

Cost SummaryTask

Total Labels

Cost in USD

Time in hours

Labels / USD

Labels / Hour

Affect Recognition

7000 $2.00 5.93 3500 1180.4

Word Similarity 300 $0.20 0.17 1500 1724.1

Textual Entailment 8000 $8.00 89.3 1000 89.59

Temporal Annotation

4620 $13.86 39.9 333.3 115.85

WSD 1770 $1.76 8.59 1005.7 206.1

All 21690 $25.82 143.9 840.0 150.7

Page 23: Rls For Emnlp 2008

In Summary• All collected data and annotator

instructions are available at: http://ai.stanford.edu/~rion/annotations

• Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com

nlp.stanford.edu ai.stanford.edudoloreslabs.com

Page 24: Rls For Emnlp 2008

Supplementary Slides

Page 25: Rls For Emnlp 2008

Training systems on nonexpert annotations• A simple affect recognition classifier trained

on the averaged nonexpert votes outperforms one trained on a single expert annotation

Page 26: Rls For Emnlp 2008

Where are Turkers?United States 77.1%

India 5.3%Philippines 2.8%

Canada 2.8%UK 1.9%

Germany 0.8%Italy 0.5%

Netherlands 0.5%Portugal 0.5%Australia 0.4%

Remaining 7.3% divided among 78 countries / territories

Analysis by Dolores Labs

Page 27: Rls For Emnlp 2008

Who are Turkers?

“Mechanical Turk: The Demographics”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com

Gender

Education

Age

Annual income

Page 28: Rls For Emnlp 2008

Why are Turkers?

“Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com

A. To Kill TimeB. Fruitful way to spend free timeC. Income purposesD. Pocket change/extra cashE. For entertainmentF. Challenge, self-competitionG. Unemployed, no regular job, part-time jobH. To sharpen/ To keep mind sharpI. Learn English

Page 29: Rls For Emnlp 2008

How much does AMT pay?

“How Much Turking Pays?”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com

Page 30: Rls For Emnlp 2008

Annotaton Guidelines: Affective Text

Page 31: Rls For Emnlp 2008

Annotaton Guidelines: Word Similarity

Page 32: Rls For Emnlp 2008

Annotaton Guidelines:Textual Entailment

Page 33: Rls For Emnlp 2008

Annotaton Guidelines: Temporal Ordering

Page 34: Rls For Emnlp 2008

Annotaton Guidelines: Word Sense Disambiguation

Page 35: Rls For Emnlp 2008

Affect Recognition

We label 100 headlines for each of 7 emotionsWe pay 4 cents for 20 headlines (140 total

labels)Total Cost: $2.00

Time to complete: 5.94 hrs

Page 36: Rls For Emnlp 2008

Example Task: Word Similarity

30 word pairs (Rubenstein and

Goodenough, xxxx)

We pay 10 Turkers 2 cents apiece to score

all 30 word pairs

Total cost: $0.20Time to complete:

10.4 minutes

Page 37: Rls For Emnlp 2008

2 4 6 8 10

0.84

0.90

0.96

annotations

correlation

Word Similarity ITA

Page 38: Rls For Emnlp 2008

• Comparison against multiple annotators

• (graphs)

• avg. number of nonexperts : expert = 4

Page 39: Rls For Emnlp 2008

Datasets lead the wayWSJ + syntactic annotation = Penn TreeBank enables Statistical

parsing

Brown corpus + sense labeling = Semcor => WSD

TreeBank + role labels = PropBank => SRL

political speeches + translations = United Nations parallel corpora => statistical machine translation

more: RTE, Timebank, ACE/MUC, etc...

Page 40: Rls For Emnlp 2008

Datasets drive research

statistical parsing

speech recognition

semantic role labeling

statistical MTsocial network

analysis

Penn TreebankPropBank

Switchboard

UN Parallel TextEnron E-mail

Corpus

Pascal RTE

textual entailment

word sensedisambiguation

WordNetSemCor