crowdsourcing (anaphoric) annotation
DESCRIPTION
CROWDSOURCING (ANAPHORIC) ANNOTATION. Massimo Poesio University of Essex Part 1: Intro, Microtask crowdsourcing. ANNOTATED CORPORA: AN ERA OF PLENTY?. - PowerPoint PPT PresentationTRANSCRIPT
CROWDSOURCING (ANAPHORIC) ANNOTATION
Massimo PoesioUniversity of Essex
Part 1: Intro, Microtask crowdsourcing
ANNOTATED CORPORA: AN ERA OF PLENTY?
• With the release of the OntoNotes and ANC corpora for English and a number of corpora for other languages (Prague Dependency Treebank especially) we may think we won’t need to do any more annotation for a while
• But this is not the case:– Size still an issue– Problems– Novel tasks
THE CASE OF ANAPHORIC ANNOTATION
• After many years of scarcity we now live in an era of abundance to study anaphora
• The release after 2008 of a number of anaphoric corpora annotated according to more linguistically inspired guidelines has enabled computational linguists to start working on a version of the problem more resembling the way linguists look at it– Also one hears less of coreference and more of anaphora
• Recent competitions all use corpora of this type:– SEMEVAL 2010, CONLL 2011 / 2012, EVALITA 2011, …
BUT …
• The larger corpora – Not that large: still only 1M words at the most
• See problems of overfitting with Penn Treebank– Cover only a few genres (news mostly)– Anno schemes still pretty basic
ONTONOTES: OUT OF SCOPE
There was not a moment to be lost: away went Alice like the wind, and was just in time to hear it say, as it turned a corner, 'Oh my ears and whiskers, how late it's getting!' She was close behind it when she turned the corner, but the Rabbit was no longer to be seen: she found herself in a long, low hall, which was lit up by a row of lamps hanging from the roof.
There were doors all round the hall, but they were all locked; and when Alice had been all the way down one side and up the other, trying every door, she walked sadly down the middle, wondering how she was ever to get out again.
ONTONOTES: OUT OF SCOPE
There was not a moment to be lost: away went Alice like the wind, and was just in time to hear it say, as it turned a corner, 'Oh my ears and whiskers, how late it's getting!' She was close behind it when she turned the corner, but the Rabbit was no longer to be seen: she found herself in a long, low hall, which was lit up by a row of lamps hanging from the roof.
There were doors all round the hall, but they were all locked; and when Alice had been all the way down one side and up the other, trying every door, she walked sadly down the middle, wondering how she was ever to get out again.
ONTONOTES: OUT OF SCOPE
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
ALSO …
• In many respects the doubts concerning the empirical approach to anaphora raised by Zaenen in her 2006 CL squib (Markup barking up the wrong tree) still apply
• “anaphora is not like syntax – we don’t understand the phenomenon quite as well”
• Even the simpler types of anaphoric annotation are still problematic– If not for linguists, for computational types
• And we only started grappling with the more complex types of anaphora
She could hear the rattle of the teacups as [[the March Hare] and [his friends]]
shared their never-ending meal
PLURALS WITH COORDINATED ANTECEDENTS: SPLIT ANTECEDENTS (GNOME,ARRAU,SERENGETI,TUBA/DZ?)
'In THAT direction,' the Cat said, waving its right paw round, 'lives [a Hatter]: and in THAT direction,' waving the other paw, 'lives [a March Hare]. Visit either you like: they're both mad.'
Alice had no idea what [Latitude] was, or[Longitude] either, but thought they were nice grand words to say.
15.12 M: we’re gonna take the engine E3
15.13 : and shove it over to Corning
15.14 : hook [it] up to [the tanker car]
15.15 : _and_
15.16 : send it back to Elmira
(from the TRAINS-91 dialogues collected at the University of Rochester)
AMBIGUITY: REFERENT
www.phrasedetectives.com
About 160 workers at a factory that made paper for the Kent filters were exposed to asbestos in the 1950s.
Areas of the factory were particularly dusty where the crocidolite was used.
Workers dumped large burlap sacks of the imported material into a huge bin, poured in cotton and acetate fibers and mechanically mixed the dry fibers in a process used to make filters.
Workers described "clouds of blue dust" that hung over parts of the factory,
even though exhaust fans ventilated the area.
AMBIGUITY: REFERENT
AMBIGUITY: EXPLETIVES
'I beg your pardon!' said the Mouse, frowning, but very politely: 'Did you speak?'
'Not I!' said the Lory hastily.
'I thought you did,' said the Mouse. '--I proceed. "Edwin and Morcar,the earls of Mercia and Northumbria, declared for him: and even Stigand,the patriotic archbishop of Canterbury, found it advisable--"'
'Found WHAT?' said the Duck.
'Found IT,' the Mouse replied rather crossly: 'of course you know what"it" means.'
AND ANYWAY …
• A great deal, if not most, research in (computational) linguistics requires anno of new types– A new syntactic phenomenon– Named entities in a new domain– A type of anaphoric annotation not yet annotated
(or annotated badly)
OUR CONTENTION
• CROWDSOURCING can be (part of) the solution to these problems– Microtask crowdsourcing for small-to-medium
scale annotation projects• Including the majority of annotation carried out by
linguists / psycholinguists (see eg Munro et al 2010)– Games-with-a-purpose for larger scale projects– Either to gather more evidence about the
linguistic phenomena
OUTLINE OF THE LECTURES
• Crowdsourcing in AI and NLP (today)– What is crowdsourcing; applications to NLP
• Games with a purpose– ESP, Verbosity
• Phrase Detectives• Analyzing crowdsourced data
CROWDSOURCING
THE PROBLEM
• Many AI problems require knowledge on a massive scale– Commonsense knowledge (700M facts?)– Vision
• Previous common wisdom:– Impossible to codify such knowledge by hand (witness CYC)– Need to learn all from scratch
• New common wisdom: “Given the advent of the World Wide Web, AI projects now have access to the minds of millions” (Singh 2002)
CROWDSOURCING
• Take a task traditionally performed by one of a few agents
• Outsource it to the crowd on the web
THE WISDOM OF CROWDS
• By rights using the ‘crowd’ should lead to poor quality
• But in fact it often turns out that the judgment of the crowd is as good or better than that of the experts
WIKIPEDIA
WIKIPEDIA AND CROWDSOURCING
• Wikipedia, although not an AI project, is perhaps the best illustration of the power of crowdsourcing – how putting together many minds may result in an output often of incredible quality
• It is also a great illustration of what works and doesn’t:– E.g., Editorial control must be exercised by the
web collaborators themselves
• The OPEN DIRECTORY PROJECT
• Crater mapping (results) – Kanefsky
• Citizen Science
• Cognition and Language Laboratory
• Web Experiments
• Galaxy Zoo – Oxford University
OTHER WEB COLLABORATION PROJECTS
www.phrasedetectives.com
Galaxy Zoo
GalaxyZoo• Launched in July 2007• 1M galaxies imaged• 50M classifications in first year, from
150,000 visitors
COLLECTIVE RESOURCE CREATION FOR AI: OPEN MIND COMMONSENSE
• Singh 2002:“Every ordinary person has common sense of the kind we want to give to our machines”
WEB COLLABORATION IN AI: OPEN MIND COMMONSENSE
www.phrasedetectives.com
OPEN MIND COMMONSENSE
OPEN MIND COMMONSENSE
• A project started in 1999 (Chklovski, 1999) to take to collect commonsense from NETIZENS
• Around 30 ACTIVITIES organized to collect knowledge– About taxonomies (CATS ARE MAMMALS)– About the uses of objects (CHAIRS ARE FOR
SITTING ON)
WHAT’S IN OPEN MIND COMMONSENSE: CAR
Twenty Semantic Relation Types in ConceptNet (Liu and Singh, 2004)
THINGS (52,000 assertions)
IsA: (IsA "apple" "fruit") Part of: (PartOf "CPU" "computer") PropertyOf: (PropertyOf "coffee" "wet") MadeOf: (MadeOf "bread" "flour") DefinedAs: (DefinedAs "meat" "flesh of animal")
EVENTS (38,000 assertions)
PrerequisiteeventOf: (PrerequisiteEventOf "read letter" "open envelope") SubeventOf: (SubeventOf "play sport" "score goal") FirstSubeventOF: (FirstSubeventOf "start fire" "light match") LastSubeventOf: (LastSubeventOf "attend classical concert" "applaud")
AGENTS (104,000 assertions)
CapableOf: (CapableOf "dentist" "pull tooth")
SPATIAL (36,000 assertions)
LocationOf: (LocationOf "army" "in war")
TEMPORAL time & sequence
CAUSAL (17,000 assertions)
EffectOf: (EffectOf "view video" "entertainment") DesirousEffectOf: (DesirousEffectOf "sweat" "take shower")
AFFECTIONAL (mood, feeling, emotions) (34,000 assertions)
DesireOf (DesireOf "person" "not be depressed") MotivationOf (MotivationOf "play game" "compete")
FUNCTIONAL (115,000 assertions)
IsUsedFor: (UsedFor "fireplace" "burn wood") CapableOfReceivingAction: (CapableOfReceivingAction "drink" "serve")
ASSOCIATION K-LINES (1.25 million assertions)
SuperThematicKLine: (SuperThematicKLine "western civilization" "civilization") ThematicKLine: (ThematicKLine "wedding dress" "veil") ConceptuallyRelatedTo: (ConceptuallyRelatedTo "bad breath" "mint")
COLLECTING COMMONSENSE KNOWLEDGE
• Originally:– Using TEMPLATES– Asking people to write stories
• Now: just templates?
OPEN MIND COMMONSENSE: ADDING KNOWLEDGE
TEMPLATES FOR ADDING KNOWLEDGE
OPEN MIND COMMONSENSE: CHECKING KNOWLEDGE
FROM OPENMIND COMMONSENSE TO CONCEPT NET
• ConceptNet (Havasi et al, 2009) is a semantic network extracted from OpenMind Commonsense assertions using simple heuristics
CONCEPT NET
FROM OPENMIND COMMONSENSE FACTS TO CONCEPTNET
A lime is a very sour fruit
isa(lime,fruit)
property_of(lime,very_sour)
• Learner / Learner2 / 1001 Paraphrases – Chklovski
• FACTory – CyCORP
• Hot or Not – 8 Days
• Semantic Wikis: www.semwiki.org
OTHER USES OF WEB COLLABORATION IN AI
www.phrasedetectives.com
CROWDSOURCING: INCENTIVES
What motivated thousands or millions of people to collaborate on the web?
• Shared intent– Wikipedia– Citizen Science
• Financial incentives– Microtask crowdsourcing
• Enjoyment– Games-with-a-purpose
MICROTASK CROWDSOURCING
THE FINANCIAL INCENTIVE: MECHANICAL TURK
• Wikipedia, OpenMind Commonsense, all rely on the voluntary effort of web users
• Mechanical Turk was developed by Amazon to take advantage of the willingness of large numbers of web users to do some work for very little pay
THE MECHANICAL TURK
AMAZON MECHANICAL TURK
HITs
• On the Mechanical Turk site, a REQUESTER creates a HUMAN INTELLIGENCE TASK (HIT) and specify how much he is willing to pay for TURKERS to complete it– Typically, the payment is of the order of 1 to 10
cents per task
A TYPICAL HIT
CREATING A HIT
• Design• Publish• Manage
RESOURCE CENTER
DESIGN
EXAMPLE: CATEGORIZATION
USING AMT IS CHEAP
… AND FAST
USING MICROTASK CROWDSOURCING FOR NLP
• Su et al. (2007): name resolution, attribute extraction
• Nakov (2008): paraphrasing noun compounds• Kaisser and Lowe (2008): sentence-level QA
annotation• Zaenen (2008): evaluating RTE agreement• Snow et al (2008): using MT for a variety of NLP
annotation purposes• Callison-Burch (2009): using MT to evaluate MT
EXAMPLE: DIALECT IDENTIFICATION
EXAMPLE: SPELLING CORRECTION
USING MICROTASK CROWDSOURCING FOR NLP
• Su et al. (2007): name resolution, attribute extraction
• Nakov (2008): paraphrasing noun compounds• Kaisser and Lowe (2008): sentence-level QA
annotation• Zaenen (2008): evaluating RTE agreement• Snow et al (2008): using AMT for a variety of NLP
annotation purposes to test its quality• Callison-Burch (2009): using MT to evaluate MT
SNOW ET AL, 2008
• Objective: assess the quality of AMT annotation using a variety of NLP annotation tasks
• Quality assessment: numeric (inter-annotator agreement) and qualitative (error analysis)
SNOW ET AL: THE TASKS
AFFECT RECOGNITION
SNOW ET AL: QUALITY
• Compare ITA of turkers with ITA of experts
ITA: EXPERTS
• 6 total experts• One expert’s ITA is
calculated as the average of Pearson correlations from each annotator to the avg. of the other 5 annotators.
ITA: TURKERS
• Average over k annotations to create a single “proto-labeler”
• Plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.
ITA ACROSS TASKS
SNOW ET AL: COST
CALLISON-BURCH: USING MECHANICAL TURK INSTEAD OF AUTOMATIC METRICS FOR MT• It is generally thought that human evaluation of
the quality of translations is too expensive• So automatic metrics like BLEU are used instead• Callison-Burch:
– Turkers produce judgments very similar to experts and much more correlated than BLEU
– Mechanical Turk can be used for a variety of tasks, including creating reference translations and evaluation through reading comprehension
Evaluation at Workshop on Statistical MT 2008
• 11 systems• Their output on test sentences ranked by
experts
Using Mechanical Turk
• Each Turker presented with the original sentence and the outputs of 5 systems; have radio buttons to rank them
• Required 975 HITs• Total cost: $9.75
AGREEMENT WITH EXPERTS
CORRELATION BETWEEN RANKINGS
CROWDSOURCING PRACTICE
• Export linguistic data as CSV file and load up into Amazon or CrowdFlower
• Create instructions as HTML
• Customise the annotation UI (e.g. may need JavaScript for markable selection)
• Select how many judgments per micro-task and any restrictions on the annotators (e.g. country of origin)
• Test it and revisit any of the above, as needed
• Launch it and collect the data
• Download the results and put together the corpus
• Adjudicate
WHAT YOU HAVE TO DO
Example: CrowdFlower Instructions
Example: Marking Locations in tweets
Example: Locations selected
Example 2: Entity Linking Annotation in Crowdflower
QUALITY CONTROL
• Have each task be performed by multiple Turkers• Turkers may have to meet qualifications, e.g.,
qualifications / approval rate– Each Turker has a ‘reliability score’ as in Ebay– P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5
and 5 Minutes.• honey pots (trap questions with known answers)• Defense against spammers• May reject work / block worker
– But: serious consequences
Dealing with bad workers
• Pay for “bad” work instead of rejecting it? – Pro: preserve reputation, admit if poor design at
fault– Con: promote fraud, undermine approval rating
system• Use bonus as incentive
– Pay the minimum $0.01 and $0.01 for bonus – Better than rejecting a $0.02 task
• Detect and block spammers
CROWDSOURCING WITH GATE
Crowdsourcing with GATE
• Download and use the GATE Crowdsourcing plugin
• https://gate.ac.uk/wiki/crowdsourcing.html
• Transforms automatically texts with GATE annotations into CrowdFlower jobs
• Generates the CF User Interface (based on templates)
• Researcher then checks and runs the project in CF
• On completion, the plugin imports automatically the results back into GATE, aligning to sentences and representing the multiple annotators
GATE Crowdsourcing Overview (1)
• Choose a job builder
– Classification
– Sequence Selection
• Configure the corresponding user interface and provide the task instructions
GATE Crowdsourcing Overview (2)
• Pre-process the corpus with TwitIE/ANNIE, e.g.
– Tokenisation
– POS tagging
– Sentence splitting
– NE recognition
• Create automatically the target annotations and any dynamic values required for classification
• Execute the job builder to upload units to CF automatically
Configure and execute the job in CF
Gold data units can also be uploaded from GATE, so CF controls quality
Automatic CF Import into GATE
• Each CF judgement is imported back as a separate annotation with some metadata
• Adjudication can happen automatically (e.g. majority vote and/or trust-based) or manually (Annotation Stack editor)
• The resulting corpus is ready to use for experiments or can be exported out of GATE as XML/XCES
CONCLUSIONS
• Microtask crowdsourcing is here to stay, especially for small to medium scale annotation
• A viable alternative to human evaluation by experts and a much better alternative than ranking by BLEU
READINGS• Push Singh (2002). The Public Acquisition of Commonsense
Knowledge, in Proc. Of the AAAI Spring Symposium on Acquiring Linguistic and World Knowledge for Information Access
• Snow et al (2008). Cheap and fast: but is it good? Proc. Of EMNLP.
• Chris Callison-Burch (2009). Fast, cheap and creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. Proc. Of EMNLP
• Poesio, Chamberlain & Kruschwitz (forthcoming). Crowdsourcing. In N. Ide & J. Pustejovsky (eds), Handbook of Linguistic Annotation.
ACKNOWLEDGMENTS
• A number of slides borrowed from– Snow et al’s presentation at EMNLP 2008– Matt Lease, Uni Texas at Austin– Kalina Bontcheva and Leon Derczynski’s EACL 2014
Tutorial on NLP for Social Media