Slide 1
Efficient Named Entity Annotation
through Pre-empting
Leon
Kalina Bontcheva
Crowdsourcing in science is not new
Citizen science, from early 19th century, 60,000 80,000 yearly volunteers
Sir Francis Galton, VOX POPULI
Francis Galton
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Crowdsourcing as an effective paradigm
Researchers enjoy annotating
which makes it expensive
Many documents are inefficient to annotate
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
What is Crowdsourcing?
Crowdsourcing is an emerging collaborative approach for acquiring annotated corpora and a wide range of other linguistic resources
Three main kinds of crowdsourcing platforms
paid-for marketplaces such as Amazon Mechanical Turk (AMT) and CrowdFlower (CF)
games with a purpose
volunteer-based platforms such as crowdcrafting
NLP researchers are increasingly using crowdsourcing as a novel, collaborative approach for obtaining linguistically annotated corpora
Example: CF Instructions
Example: CF Marking Locations in tweets
Example: CF Locations selected
Example 2: Entity Linking Annotation in CF
How to do it: The Easy Way
Download and use the GATE Crowdsourcing plugin
https://gate.ac.uk/wiki/crowdsourcing.html
Transforms automatically texts with GATE annotations into CF jobs
Generates the CF User Interface (based on templates)
Researcher then checks and runs the project in CF
On completion, the plugin automatically imports the results back into GATE, aligning sentences and representing the multiple annotators
GATE Crowdsourcing Overview (1)
Choose a job builder
Classification
Sequence Selection
Configure the corresponding user interface and provide the task instructions
GATE Crowdsourcing Overview (2)
Pre-process the corpus with TwitIE/ANNIE, e.g.
Tokenisation
POS tagging
Sentence splitting
NE recognition
Create automatically the target annotations and any dynamic values required for classification
Execute the job builder to upload units to CF automatically
Configure and execute the job in CF
Gold data units can also be uploaded from GATE, so CF controls quality
Automatic CF Import into GATE
Each CF judgement is imported back as a separate annotation with some metadata
Adjudication can happen automatically (e.g. majority vote and/or trust-based) or manually (Annotation Stack editor)
The resulting corpus is ready to use for experiments or can be exported out of GATE as XML/XCES
Side effect
Medium-size corpus, and a...
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
How can this cost be reduced?
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
How can this cost be reduced?
Introduce determinism
Hypothesis: do entity-bearing sentences improve NER
performance?
Features:Character n-grams
Word shape n-grams
Token n-grams
Pretty good!
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Can we predict entities?
Baselines:1. Random
2. All proper nouns = entities
Classifiers:Maxent
SVM
Cost-weighted SVM
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Can we predict entities?
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Validating the results: again
Saving money through ML?
A bit too good to be true.. or is it
Compare hand-labelled pre-empted to hand-labelled random
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Cross-lingual investigation
English is a bit boring
How about something else?
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Cross-lingual investigation
English is a bit boring
germanic; non-germanic; morphologically rich
Entity prediction universally great!
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Cross-lingual investigation
This looks good! But how about extrinsic results?
Does this help NER?
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Building a cropus
Let's try building a corpus
Social media: high variation
Insufficient diversity in NLP researchers (KKTNY in
45min...)
Does our hypothesis apply in this text type?
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Building a cropus
Can we pre-empt in tweets as well?
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Building a cropus
Let's try and get greedy can we do this per-type?
Entity classifications tend to be arbitrary
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Which features are useful?
Feature ablation
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Which features are useful?
Highest-weighted features
In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.
Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.
This is not a novel phenomenon
Citizen science projects around since the beginning of last century (at least)
There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)
IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
GWAP Use Case
Language Quiz
www.twitter.com/uCompEU
Master Thesis Tutorial; Karl Wber
Language Quiz
Provide an open API to enable partners to send news tasks to the game (or crowdflower)
The game supports various task types (at launch: multiple choice questions and sentiment detection)
Players receive points through correct answers in the game
The correct answers will be determined by majority vote, after enough answers have been collected
Each month the highscores will be reseted and a monthly winner is determined
Players are able to invite their friends, compete against them and receive bonus points through their activity
Language Quiz
Thank you for your time!
Leon Derczynski
Kalina Bontcheva
This was part of the uComp
project (www.ucomp.eu). uComp receives the
funding support of EPSRC EP/K017896/1, FWF
1097-N23, and ANR-12-CHRI-0003-03, in the
framework of the CHIST-ERA ERA-NET.
University of Sheffield, NLP
www.ucomp.eu | www.chistera.eu @uCompEU
Master Thesis Tutorial; Karl Wber
08/09/15