example: build an “adult web site” classifier
DESCRIPTION
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University. Example: Build an “Adult Web Site” Classifier. Need a large number of hand-labeled sites Get people to look at sites and classify them as: - PowerPoint PPT PresentationTRANSCRIPT
Crowdsourcing using Mechanical Turk
Quality Management and Scalability
Panos Ipeirotis – New York University
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X
(porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X
(porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr MTurk: 2500 websites/hr, cost: $12/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general
audience)
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general
audience)
Improve Data Quality through Repeated Labeling
Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote
Probability of correctness increases with number of workers
Probability of correctness increases with quality of workers
1 worker
70%
correct
1 worker
70%
correct
11 workers
93%
correct
11 workers
93%
correct
Using redundant votes, we can infer worker quality
Look at our spammer friend ATAMRO447HWJQ together with other 9 workers
Our “friend” ATAMRO447HWJQ mainly marked sites as G.Obviously a spammer…
We can compute error rates for each worker
Error rates for ATAMRO447HWJQ P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947%
Rejecting spammers and BenefitsRandom answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 49.6% P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947%
Action: REJECT and BLOCK
Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher
Too much theory?
Demo and Open source implementation available at:
http://qmturk.appspot.com Input:
– Labels from Mechanical Turk– Some “gold” data (optional)– Cost of incorrect labelings (e.g., XG costlier than
GX)
Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality
How to handle free-form answers?
Q: “My task does not have discrete answers….” A: Break into two HITs:
– “Create” HIT– “Vote” HIT
Vote HIT controls quality of Creation HIT Redundancy controls quality of Voting HIT
Catch: If “creation” very good, in voting workers just vote “yes”– Solution: Add some random noise (e.g. misspell)
Creation HIT
(e.g. transcribe
caption)
Creation HIT
(e.g. transcribe
caption)
Voting HIT:
Correct or not?
Voting HIT:
Correct or not?
Example: Collect URLs
But my free-form is not just right or wrong…
“Create” HIT “Improve” HIT “Compare” HIT
Creation HIT
(e.g. describe the image)
Creation HIT
(e.g. describe the image)
TurkIt toolkit: http://groups.csail.mit.edu/uid/turkit/
Improve HIT
(e.g. improve description)
Improve HIT
(e.g. improve description)
Compare HIT (voting)
Which is better?
Compare HIT (voting)
Which is better?
Describe this
version 1:
A parial view of a pocket calculator together with some coins and a pen.
version 2:A view of personal items a calculator, and some gold and copper coins, and a round tip pen, these are all pocketand wallet sized item used for business, writting, calculating prices or solving math problems and purchasing items.
version 3:A close-up photograph of the following items: A CASIO multi-function calculator. A ball point pen, uncapped. Various coins, apparently European, both copper and gold. Seems to be a theme illustration for a brochure or document cover treating finance, probably personal finance.
version 4:…Various British coins; two of £1 value, three of 20p value and one of 1p value. …
version 8:
“A close-up photograph of the following items: A CASIO multi-function, solar powered scientific calculator. A blue ball point pen with a blue rubber grip and the tip extended. Six British coins; two of £1 value, three of 20p value and one of 1p value. Seems to be a theme illustration for a brochure or document cover treating finance - probably personal finance."
Future: Break big task to simple ones and build workflow Running experiment: Crowdsource big tasks (e.g., tourist
guide)
My Boss is a Robot (mybossisarobot.com)Nikki Kittur (Carnegie Mellon) + Jim Giles (New Scientist)
– Identify sights worth checking out (one tip per worker)• Vote and rank
– Brief tips for each monument (one tip per worker)• Vote and rank
– Aggregate tips in meaningful summary• Iterate to improve…
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com
/
Email: [email protected]
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com
/
Email: [email protected]
Correcting biases
Classifying sites as G, PG, R, X Sometimes workers are careful but biased
Classifies G → P and P → R Average error rate : too high
Is she a spammer?Is she a spammer?
Error Rates for CEO of company detecting offensive content (and parent)
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for CEO of company detecting offensive content (and parent)
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Correcting biases
For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias
True error-rate ~ 9%
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Scaling Crowdsourcing: Use Machine Learning Human labor is expensive, even when paying cents Need to scale crowdsourcing
Basic idea: Build a machine learning model and use it instead of humans
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
New CaseNew Case Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Automatic
Answer
Automatic
Answer
22
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy Improve data quality Improve classification
Example Case: Porn or not?
40
50
60
70
80
90
100
1 20 40 60 80 100120140160180200220240260280300
Number of examples (Mushroom)
Acc
ura
cy
Data Quality = 50%
Data Quality = 60%
Data Quality = 80%
Data Quality = 100%
Confide
ntConfide
nt
Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Scaling Crowdsourcing: Iterative training
Use machine when confident, humans otherwise
Retrain with new human input → improve model → reduce need for humans
Get human(s)
to answer
Get human(s)
to answer
New CaseNew Case
Not confident
Not confident
Automatic
Answer
Automatic
Answer
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
24
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy Improve data quality Improve classification
Example Case: Porn or not?
40
50
60
70
80
90
100
1 20 40 60 80 100120140160180200220240260280300
Number of examples (Mushroom)
Acc
ura
cy
Data Quality = 50%
Data Quality = 60%
Data Quality = 80%
Data Quality = 100%
Not confident
Not confident
Confiden
tConfiden
t
Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Scaling Crowdsourcing: Iterative training, with noise
Use machine when confident, humans otherwise Ask as many humans as necessary to ensure quality
Get human(s)
to answer
Get human(s)
to answer
New CaseNew Case
Automatic
Answer
Automatic
Answer
Confident for quality?
Not confident
for quality?
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers