crowdsourcing (anaphoric) annotation

CROWDSOURCING (ANAPHORIC) ANNOTATION

Massimo PoesioUniversity of Essex

Part 1: Intro, Microtask crowdsourcing

ANNOTATED CORPORA: AN ERA OF PLENTY?

• With the release of the OntoNotes and ANC corpora for English and a number of corpora for other languages (Prague Dependency Treebank especially) we may think we won’t need to do any more annotation for a while

• But this is not the case:– Size still an issue– Problems– Novel tasks

THE CASE OF ANAPHORIC ANNOTATION

• After many years of scarcity we now live in an era of abundance to study anaphora

• The release after 2008 of a number of anaphoric corpora annotated according to more linguistically inspired guidelines has enabled computational linguists to start working on a version of the problem more resembling the way linguists look at it– Also one hears less of coreference and more of anaphora

• Recent competitions all use corpora of this type:– SEMEVAL 2010, CONLL 2011 / 2012, EVALITA 2011, …

BUT …

• The larger corpora – Not that large: still only 1M words at the most

• See problems of overfitting with Penn Treebank– Cover only a few genres (news mostly)– Anno schemes still pretty basic

ONTONOTES: OUT OF SCOPE

There was not a moment to be lost: away went Alice like the wind, and was just in time to hear it say, as it turned a corner, 'Oh my ears and whiskers, how late it's getting!' She was close behind it when she turned the corner, but the Rabbit was no longer to be seen: she found herself in a long, low hall, which was lit up by a row of lamps hanging from the roof.

There were doors all round the hall, but they were all locked; and when Alice had been all the way down one side and up the other, trying every door, she walked sadly down the middle, wondering how she was ever to get out again.

ONTONOTES: OUT OF SCOPE

There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.

ALSO …

• In many respects the doubts concerning the empirical approach to anaphora raised by Zaenen in her 2006 CL squib (Markup barking up the wrong tree) still apply

• “anaphora is not like syntax – we don’t understand the phenomenon quite as well”

• Even the simpler types of anaphoric annotation are still problematic– If not for linguists, for computational types

• And we only started grappling with the more complex types of anaphora

She could hear the rattle of the teacups as [[the March Hare] and [his friends]]

shared their never-ending meal

PLURALS WITH COORDINATED ANTECEDENTS: SPLIT ANTECEDENTS (GNOME,ARRAU,SERENGETI,TUBA/DZ?)

'In THAT direction,' the Cat said, waving its right paw round, 'lives [a Hatter]: and in THAT direction,' waving the other paw, 'lives [a March Hare]. Visit either you like: they're both mad.'

Alice had no idea what [Latitude] was, or[Longitude] either, but thought they were nice grand words to say.

15.12 M: we’re gonna take the engine E3

15.13 : and shove it over to Corning

15.14 : hook [it] up to [the tanker car]

15.15 : _and_

15.16 : send it back to Elmira

(from the TRAINS-91 dialogues collected at the University of Rochester)

AMBIGUITY: REFERENT

www.phrasedetectives.com

About 160 workers at a factory that made paper for the Kent filters were exposed to asbestos in the 1950s.

Areas of the factory were particularly dusty where the crocidolite was used.

Workers dumped large burlap sacks of the imported material into a huge bin, poured in cotton and acetate fibers and mechanically mixed the dry fibers in a process used to make filters.

Workers described "clouds of blue dust" that hung over parts of the factory,

even though exhaust fans ventilated the area.

AMBIGUITY: REFERENT

AMBIGUITY: EXPLETIVES

'I beg your pardon!' said the Mouse, frowning, but very politely: 'Did you speak?'

'Not I!' said the Lory hastily.

'I thought you did,' said the Mouse. '--I proceed. "Edwin and Morcar,the earls of Mercia and Northumbria, declared for him: and even Stigand,the patriotic archbishop of Canterbury, found it advisable--"'

'Found WHAT?' said the Duck.

'Found IT,' the Mouse replied rather crossly: 'of course you know what"it" means.'

AND ANYWAY …

• A great deal, if not most, research in (computational) linguistics requires anno of new types– A new syntactic phenomenon– Named entities in a new domain– A type of anaphoric annotation not yet annotated

(or annotated badly)

OUR CONTENTION

• CROWDSOURCING can be (part of) the solution to these problems– Microtask crowdsourcing for small-to-medium

scale annotation projects• Including the majority of annotation carried out by

linguists / psycholinguists (see eg Munro et al 2010)– Games-with-a-purpose for larger scale projects– Either to gather more evidence about the

linguistic phenomena

OUTLINE OF THE LECTURES

• Crowdsourcing in AI and NLP (today)– What is crowdsourcing; applications to NLP

• Games with a purpose– ESP, Verbosity

• Phrase Detectives• Analyzing crowdsourced data

CROWDSOURCING

THE PROBLEM

• Many AI problems require knowledge on a massive scale– Commonsense knowledge (700M facts?)– Vision

• Previous common wisdom:– Impossible to codify such knowledge by hand (witness CYC)– Need to learn all from scratch

• New common wisdom: “Given the advent of the World Wide Web, AI projects now have access to the minds of millions” (Singh 2002)

CROWDSOURCING

• Take a task traditionally performed by one of a few agents

• Outsource it to the crowd on the web

THE WISDOM OF CROWDS

• By rights using the ‘crowd’ should lead to poor quality

• But in fact it often turns out that the judgment of the crowd is as good or better than that of the experts

WIKIPEDIA

WIKIPEDIA AND CROWDSOURCING

• Wikipedia, although not an AI project, is perhaps the best illustration of the power of crowdsourcing – how putting together many minds may result in an output often of incredible quality

• It is also a great illustration of what works and doesn’t:– E.g., Editorial control must be exercised by the

web collaborators themselves

• The OPEN DIRECTORY PROJECT

• Crater mapping (results) – Kanefsky

• Citizen Science

• Cognition and Language Laboratory

• Web Experiments

• Galaxy Zoo – Oxford University

OTHER WEB COLLABORATION PROJECTS


http://clickworkers.arc.nasa.gov/top

http://clickworkers.arc.nasa.gov/LPSC

Galaxy Zoo

GalaxyZoo• Launched in July 2007• 1M galaxies imaged• 50M classifications in first year, from

150,000 visitors

COLLECTIVE RESOURCE CREATION FOR AI: OPEN MIND COMMONSENSE

• Singh 2002:“Every ordinary person has common sense of the kind we want to give to our machines”

WEB COLLABORATION IN AI: OPEN MIND COMMONSENSE


OPEN MIND COMMONSENSE

OPEN MIND COMMONSENSE

• A project started in 1999 (Chklovski, 1999) to take to collect commonsense from NETIZENS

• Around 30 ACTIVITIES organized to collect knowledge– About taxonomies (CATS ARE MAMMALS)– About the uses of objects (CHAIRS ARE FOR

SITTING ON)

WHAT’S IN OPEN MIND COMMONSENSE: CAR

Twenty Semantic Relation Types in ConceptNet (Liu and Singh, 2004)

THINGS (52,000 assertions)

IsA: (IsA "apple" "fruit") Part of: (PartOf "CPU" "computer") PropertyOf: (PropertyOf "coffee" "wet") MadeOf: (MadeOf "bread" "flour") DefinedAs: (DefinedAs "meat" "flesh of animal")

EVENTS (38,000 assertions)

PrerequisiteeventOf: (PrerequisiteEventOf "read letter" "open envelope") SubeventOf: (SubeventOf "play sport" "score goal") FirstSubeventOF: (FirstSubeventOf "start fire" "light match") LastSubeventOf: (LastSubeventOf "attend classical concert" "applaud")

AGENTS (104,000 assertions)

CapableOf: (CapableOf "dentist" "pull tooth")

SPATIAL (36,000 assertions)

LocationOf: (LocationOf "army" "in war")

TEMPORAL time & sequence

CAUSAL (17,000 assertions)

EffectOf: (EffectOf "view video" "entertainment") DesirousEffectOf: (DesirousEffectOf "sweat" "take shower")

AFFECTIONAL (mood, feeling, emotions) (34,000 assertions)

DesireOf (DesireOf "person" "not be depressed") MotivationOf (MotivationOf "play game" "compete")

FUNCTIONAL (115,000 assertions)

IsUsedFor: (UsedFor "fireplace" "burn wood") CapableOfReceivingAction: (CapableOfReceivingAction "drink" "serve")

ASSOCIATION K-LINES (1.25 million assertions)

SuperThematicKLine: (SuperThematicKLine "western civilization" "civilization") ThematicKLine: (ThematicKLine "wedding dress" "veil") ConceptuallyRelatedTo: (ConceptuallyRelatedTo "bad breath" "mint")

COLLECTING COMMONSENSE KNOWLEDGE

• Originally:– Using TEMPLATES– Asking people to write stories

• Now: just templates?

OPEN MIND COMMONSENSE: ADDING KNOWLEDGE

TEMPLATES FOR ADDING KNOWLEDGE

OPEN MIND COMMONSENSE: CHECKING KNOWLEDGE

FROM OPENMIND COMMONSENSE TO CONCEPT NET

• ConceptNet (Havasi et al, 2009) is a semantic network extracted from OpenMind Commonsense assertions using simple heuristics

CONCEPT NET

FROM OPENMIND COMMONSENSE FACTS TO CONCEPTNET

A lime is a very sour fruit

isa(lime,fruit)

property_of(lime,very_sour)

• Learner / Learner2 / 1001 Paraphrases – Chklovski

• FACTory – CyCORP

• Hot or Not – 8 Days

• Semantic Wikis: www.semwiki.org

OTHER USES OF WEB COLLABORATION IN AI


http://game.cyc.com/game.html

http://www.hotornot.com/

CROWDSOURCING: INCENTIVES

What motivated thousands or millions of people to collaborate on the web?

• Shared intent– Wikipedia– Citizen Science

• Financial incentives– Microtask crowdsourcing

• Enjoyment– Games-with-a-purpose

MICROTASK CROWDSOURCING

THE FINANCIAL INCENTIVE: MECHANICAL TURK

• Wikipedia, OpenMind Commonsense, all rely on the voluntary effort of web users

• Mechanical Turk was developed by Amazon to take advantage of the willingness of large numbers of web users to do some work for very little pay

THE MECHANICAL TURK

AMAZON MECHANICAL TURK

HITs

• On the Mechanical Turk site, a REQUESTER creates a HUMAN INTELLIGENCE TASK (HIT) and specify how much he is willing to pay for TURKERS to complete it– Typically, the payment is of the order of 1 to 10

cents per task

A TYPICAL HIT

CREATING A HIT

• Design• Publish• Manage

RESOURCE CENTER

DESIGN

EXAMPLE: CATEGORIZATION

USING AMT IS CHEAP

… AND FAST

USING MICROTASK CROWDSOURCING FOR NLP

• Su et al. (2007): name resolution, attribute extraction

• Nakov (2008): paraphrasing noun compounds• Kaisser and Lowe (2008): sentence-level QA

annotation• Zaenen (2008): evaluating RTE agreement• Snow et al (2008): using MT for a variety of NLP

annotation purposes• Callison-Burch (2009): using MT to evaluate MT

EXAMPLE: DIALECT IDENTIFICATION

EXAMPLE: SPELLING CORRECTION

USING MICROTASK CROWDSOURCING FOR NLP

• Su et al. (2007): name resolution, attribute extraction

• Nakov (2008): paraphrasing noun compounds• Kaisser and Lowe (2008): sentence-level QA

annotation• Zaenen (2008): evaluating RTE agreement• Snow et al (2008): using AMT for a variety of NLP

annotation purposes to test its quality• Callison-Burch (2009): using MT to evaluate MT

SNOW ET AL, 2008

• Objective: assess the quality of AMT annotation using a variety of NLP annotation tasks

• Quality assessment: numeric (inter-annotator agreement) and qualitative (error analysis)

SNOW ET AL: THE TASKS

AFFECT RECOGNITION

SNOW ET AL: QUALITY

• Compare ITA of turkers with ITA of experts

ITA: EXPERTS

• 6 total experts• One expert’s ITA is

calculated as the average of Pearson correlations from each annotator to the avg. of the other 5 annotators.

ITA: TURKERS

• Average over k annotations to create a single “proto-labeler”

• Plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.

ITA ACROSS TASKS

SNOW ET AL: COST

CALLISON-BURCH: USING MECHANICAL TURK INSTEAD OF AUTOMATIC METRICS FOR MT• It is generally thought that human evaluation of

the quality of translations is too expensive• So automatic metrics like BLEU are used instead• Callison-Burch:

– Turkers produce judgments very similar to experts and much more correlated than BLEU

– Mechanical Turk can be used for a variety of tasks, including creating reference translations and evaluation through reading comprehension

Evaluation at Workshop on Statistical MT 2008

• 11 systems• Their output on test sentences ranked by

experts

Using Mechanical Turk

• Each Turker presented with the original sentence and the outputs of 5 systems; have radio buttons to rank them

• Required 975 HITs• Total cost: $9.75

AGREEMENT WITH EXPERTS

CORRELATION BETWEEN RANKINGS

CROWDSOURCING PRACTICE

• Export linguistic data as CSV file and load up into Amazon or CrowdFlower

• Create instructions as HTML

• Customise the annotation UI (e.g. may need JavaScript for markable selection)

• Select how many judgments per micro-task and any restrictions on the annotators (e.g. country of origin)

• Test it and revisit any of the above, as needed

• Launch it and collect the data

• Download the results and put together the corpus

• Adjudicate

WHAT YOU HAVE TO DO

Example: CrowdFlower Instructions

Example: Marking Locations in tweets

Example: Locations selected

Example 2: Entity Linking Annotation in Crowdflower

QUALITY CONTROL

• Have each task be performed by multiple Turkers• Turkers may have to meet qualifications, e.g.,

qualifications / approval rate– Each Turker has a ‘reliability score’ as in Ebay– P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5

and 5 Minutes.• honey pots (trap questions with known answers)• Defense against spammers• May reject work / block worker

– But: serious consequences

Dealing with bad workers

• Pay for “bad” work instead of rejecting it? – Pro: preserve reputation, admit if poor design at

fault– Con: promote fraud, undermine approval rating

system• Use bonus as incentive

– Pay the minimum $0.01 and $0.01 for bonus – Better than rejecting a $0.02 task

• Detect and block spammers

CROWDSOURCING WITH GATE

Crowdsourcing with GATE

• Download and use the GATE Crowdsourcing plugin

• https://gate.ac.uk/wiki/crowdsourcing.html

• Transforms automatically texts with GATE annotations into CrowdFlower jobs

• Generates the CF User Interface (based on templates)

• Researcher then checks and runs the project in CF

• On completion, the plugin imports automatically the results back into GATE, aligning to sentences and representing the multiple annotators

https://gate.ac.uk/wiki/crowdsourcing.html

GATE Crowdsourcing Overview (1)

• Choose a job builder

– Classification

– Sequence Selection

• Configure the corresponding user interface and provide the task instructions

GATE Crowdsourcing Overview (2)

• Pre-process the corpus with TwitIE/ANNIE, e.g.

– Tokenisation

– POS tagging

– Sentence splitting

– NE recognition

• Create automatically the target annotations and any dynamic values required for classification

• Execute the job builder to upload units to CF automatically

Configure and execute the job in CF

Gold data units can also be uploaded from GATE, so CF controls quality

Automatic CF Import into GATE

• Each CF judgement is imported back as a separate annotation with some metadata

• Adjudication can happen automatically (e.g. majority vote and/or trust-based) or manually (Annotation Stack editor)

• The resulting corpus is ready to use for experiments or can be exported out of GATE as XML/XCES

CONCLUSIONS

• Microtask crowdsourcing is here to stay, especially for small to medium scale annotation

• A viable alternative to human evaluation by experts and a much better alternative than ranking by BLEU

READINGS• Push Singh (2002). The Public Acquisition of Commonsense

Knowledge, in Proc. Of the AAAI Spring Symposium on Acquiring Linguistic and World Knowledge for Information Access

• Snow et al (2008). Cheap and fast: but is it good? Proc. Of EMNLP.

• Chris Callison-Burch (2009). Fast, cheap and creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. Proc. Of EMNLP

• Poesio, Chamberlain & Kruschwitz (forthcoming). Crowdsourcing. In N. Ide & J. Pustejovsky (eds), Handbook of Linguistic Annotation.

ACKNOWLEDGMENTS

• A number of slides borrowed from– Snow et al’s presentation at EMNLP 2008– Matt Lease, Uni Texas at Austin– Kalina Bontcheva and Leon Derczynski’s EACL 2014

Tutorial on NLP for Social Media

crowdsourcing (anaphoric) annotation

Documents