rls for emnlp 2008

Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng

Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations

for Natural Language Tasks

The primacy of data

(Banko and Brill, 2001): Scaling to Very Very Large Corpora

for Natural Language Disambiguation

Datasets drive research

statistical parsing

speech recognition

semantic role labeling

statistical machine

translation

Penn TreebankPropBank

Switchboard

UN Parallel TextPascal RTE

textual entailment

word sensedisambiguation

WordNetSemCor

The advent of human computation

• Open Mind Common Sense (Singh et al., 2002)

• Games with a Purpose (von Ahn and Dabbish, 2004)

• Online Word Games (Vickrey et al., 2008)

Amazon Mechanical TurkBut what if your task isn’t “fun”?

mturk.com

Using AMT for dataset creation

• Su et al. (2007): name resolution, attribute extraction

• Nakov (2008): paraphrasing noun compounds

• Kaisser and Lowe (2008): sentence-level QA annotation

• Kaisser et al. (2008): customizing QA summary length

• Zaenen (2008): evaluating RTE agreement

Using AMT is cheap

Paper Labels Cents/Label

Su et al. (2007) 10,500 1.5

Nakov (2008) 19,018 unreported

Kaisser and Lowe (2008) 24,321 2.0

Kaisser et al. (2008) 45,300 3.7

Zaenen (2008) 4,000 2.0

And it’s fast...

blog.doloreslabs.com

But is it good?

• Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data

• Method: pick 5 standard datasets, and relabel each point with 10 new annotations

• Compare Turk agreement to dataset with reported expert interannotator agreement

Tasks• Affect recognition

• Strapparava and Mihalcea (2007)

• Word Similarity

• Miller and Charles (1991)

• Textual Entailment

• Dagan et al. (2006)

• WSD

• Pradhan et al. (2007)

• Temporal Annotation

• Pustejovsky et al. (2003)

sim(boy, lad) > sim(rooster, noon)

ran happens before fell in:“The horse ran past the barn fell.”

“a bass on the line” vs. “a funky bass line”

if “Microsoft was established in Italy in 1985”,then “Microsoft was established in 1985” ?

fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”)

TasksTask

Expert Labelers

Unique Examples

Interannotator Agreement

Answer Type

Affect Recognition

6 700 0.603 numeric

Word Similarity

1 30 0.958 numeric

Textual Entailment

1 800 0.91 binary

Temporal Annotation

1 462 Unknown binary

WSD 1 177 Unknown ternary

Affect Recognition


• 6 total experts.

• One expert’s ITA is calculated as the average of Pearson correlations from each annotator to the avg. of the other 5 annotators.

Emotion 1-E ITA

Anger 0.459

Disgust 0.583

Fear 0.711

Joy 0.596

Sadness 0.645

Surprise 0.464

Valence 0.844

All 0.603

Nonexpert ITAWe average over k annotations to create a single “proto-labeler”.

We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.

Interannotator AgreementEmotion 1-E ITA 10-N ITA

Anger 0.459 0.675

Disgust 0.583 0.746

Fear 0.711 0.689

Joy 0.596 0.632

Sadness 0.645 0.776

Surprise 0.464 0.496

Valence 0.844 0.669

All 0.603 0.694

2 4 6 8 100.45

0.55

0.65

correlation

anger

2 4 6 8 10

0.55

0.65

0.75

correlation

disgust

2 4 6 8 100.40

0.50

0.60

0.70

correlation

fear

2 4 6 8 10

0.35

0.45

0.55

0.65

correlation

joy

2 4 6 8 100.55

0.65

0.75

annotators

correlation

sadness

2 4 6 8 100.20

0.30

0.40

0.50

annotators

correlation

surprise

Number of nonexpert annotators required to match expert ITA, on average: 4

Task 1-E ITA 10-N ITA

Affect Recognition 0.603 0.694

Word Similarity

0.958 0.952

Textual Entailment 0.91 0.897

Temporal Annotation

0.940

WSD 0.994

2 4 6 8 10

0.84

0.90

0.96

correlation

word similarity

2 4 6 8 100.70

0.80

0.90

accuracy

RTE

2 4 6 8 100.70

0.80

0.90

annotators

accuracy

before/after

2 4 6 8 100.980

0.990

1.000

annotators

accuracy

WSD


Error Analysis: WSDonly 1 “mistake” out of 177 labels:

“The Egyptian president said he would visit Libya today...”

Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.

Error Analysis: RTE

• Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.”

• Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/

~10 disagreements out of 100:

Close Examples

T: “Google files for its long awaited IPO.”

H: “Google goes public.”

Labeled “TRUE” in PASCAL RTE-1,Turkers vote 6-4 “FALSE”.

T: A car bomb that exploded outside a U.S. military base near Beiji, killed 11 Iraqis.

H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis.

Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”.

http://lingpipe-blog.com


Weighting Annotators

• There are a small number of very prolific, very noisy annotators. If we plot each annotator:

0 200 400 600 800

0.4

0.6

0.8

1.0

number of annotations

accu

racy

Task: RTE

• We should be able to do better than majority voting.


• To infer the true value xi, we weight each response yi from annotator w using a small gold standard training set:

• We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.


annotators

accu

racy

0.7

0.8

0.9

RTE

annotators

0.7

0.8

0.9

before/after

Gold calibratedNaive voting

• Several follow-up posts at http://lingpipe-blog.com

RTE: 4.0% avg. accuracy increase

Temporal: 3.4% avg. accuracy increase



Cost SummaryTask

Total Labels

Cost in USD

Time in hours

Labels / USD

Labels / Hour

Affect Recognition

7000 $2.00 5.93 3500 1180.4

Word Similarity 300 $0.20 0.17 1500 1724.1

Textual Entailment 8000 $8.00 89.3 1000 89.59

Temporal Annotation

4620 $13.86 39.9 333.3 115.85

WSD 1770 $1.76 8.59 1005.7 206.1

All 21690 $25.82 143.9 840.0 150.7

In Summary• All collected data and annotator

instructions are available at: http://ai.stanford.edu/~rion/annotations

• Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com

nlp.stanford.edu ai.stanford.edudoloreslabs.com

http://ai.stanford.edu/~rion/annotations




Supplementary Slides

Training systems on nonexpert annotations• A simple affect recognition classifier trained

on the averaged nonexpert votes outperforms one trained on a single expert annotation

Where are Turkers?United States 77.1%

India 5.3%Philippines 2.8%

Canada 2.8%UK 1.9%

Germany 0.8%Italy 0.5%

Netherlands 0.5%Portugal 0.5%Australia 0.4%

Remaining 7.3% divided among 78 countries / territories

Analysis by Dolores Labs

Who are Turkers?

“Mechanical Turk: The Demographics”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com

Gender

Education

Age

Annual income

Why are Turkers?

“Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com

A. To Kill TimeB. Fruitful way to spend free timeC. Income purposesD. Pocket change/extra cashE. For entertainmentF. Challenge, self-competitionG. Unemployed, no regular job, part-time jobH. To sharpen/ To keep mind sharpI. Learn English

How much does AMT pay?

“How Much Turking Pays?”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com

Annotaton Guidelines: Affective Text

Annotaton Guidelines: Word Similarity

Annotaton Guidelines:Textual Entailment

Annotaton Guidelines: Temporal Ordering

Annotaton Guidelines: Word Sense Disambiguation

Affect Recognition

We label 100 headlines for each of 7 emotionsWe pay 4 cents for 20 headlines (140 total

labels)Total Cost: $2.00

Time to complete: 5.94 hrs

Example Task: Word Similarity

30 word pairs (Rubenstein and

Goodenough, xxxx)

We pay 10 Turkers 2 cents apiece to score

all 30 word pairs

Total cost: $0.20Time to complete:

10.4 minutes

2 4 6 8 10

0.84

0.90

0.96

annotations

correlation

Word Similarity ITA

• Comparison against multiple annotators

• (graphs)

• avg. number of nonexperts : expert = 4

Datasets lead the wayWSJ + syntactic annotation = Penn TreeBank enables Statistical

parsing

Brown corpus + sense labeling = Semcor => WSD

TreeBank + role labels = PropBank => SRL

political speeches + translations = United Nations parallel corpora => statistical machine translation

more: RTE, Timebank, ACE/MUC, etc...

Datasets drive research

statistical parsing

speech recognition

semantic role labeling

statistical MTsocial network

analysis

Penn TreebankPropBank

Switchboard

UN Parallel TextEnron E-mail

Corpus

Pascal RTE

textual entailment

word sensedisambiguation

WordNetSemCor

rls for emnlp 2008

Technology