![Page 1: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/1.jpg)
Crowdsourcing using Mechanical Turk
Quality Management and Scalability
Panos Ipeirotis – New York University
![Page 2: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/2.jpg)
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X
(porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr
![Page 3: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/3.jpg)
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X
(porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr MTurk: 2500 websites/hr, cost: $12/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr MTurk: 2500 websites/hr, cost: $12/hr
![Page 4: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/4.jpg)
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general
audience)
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general
audience)
![Page 5: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/5.jpg)
Improve Data Quality through Repeated Labeling
Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote
Probability of correctness increases with number of workers
Probability of correctness increases with quality of workers
1 worker
70%
correct
1 worker
70%
correct
11 workers
93%
correct
11 workers
93%
correct
![Page 6: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/6.jpg)
Using redundant votes, we can infer worker quality
Look at our spammer friend ATAMRO447HWJQ together with other 9 workers
Our “friend” ATAMRO447HWJQ mainly marked sites as G.Obviously a spammer…
We can compute error rates for each worker
Error rates for ATAMRO447HWJQ P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947%
![Page 7: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/7.jpg)
Rejecting spammers and BenefitsRandom answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 49.6% P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947%
Action: REJECT and BLOCK
Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher
![Page 8: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/8.jpg)
Too much theory?
Demo and Open source implementation available at:
http://qmturk.appspot.com Input:
– Labels from Mechanical Turk– Some “gold” data (optional)– Cost of incorrect labelings (e.g., XG costlier than
GX)
Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality
![Page 9: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/9.jpg)
![Page 10: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/10.jpg)
![Page 11: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/11.jpg)
How to handle free-form answers?
Q: “My task does not have discrete answers….” A: Break into two HITs:
– “Create” HIT– “Vote” HIT
Vote HIT controls quality of Creation HIT Redundancy controls quality of Voting HIT
Catch: If “creation” very good, in voting workers just vote “yes”– Solution: Add some random noise (e.g. misspell)
Creation HIT
(e.g. transcribe
caption)
Creation HIT
(e.g. transcribe
caption)
Voting HIT:
Correct or not?
Voting HIT:
Correct or not?
Example: Collect URLs
![Page 12: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/12.jpg)
But my free-form is not just right or wrong…
“Create” HIT “Improve” HIT “Compare” HIT
Creation HIT
(e.g. describe the image)
Creation HIT
(e.g. describe the image)
TurkIt toolkit: http://groups.csail.mit.edu/uid/turkit/
Improve HIT
(e.g. improve description)
Improve HIT
(e.g. improve description)
Compare HIT (voting)
Which is better?
Compare HIT (voting)
Which is better?
Describe this
![Page 13: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/13.jpg)
version 1:
A parial view of a pocket calculator together with some coins and a pen.
version 2:A view of personal items a calculator, and some gold and copper coins, and a round tip pen, these are all pocketand wallet sized item used for business, writting, calculating prices or solving math problems and purchasing items.
version 3:A close-up photograph of the following items: A CASIO multi-function calculator. A ball point pen, uncapped. Various coins, apparently European, both copper and gold. Seems to be a theme illustration for a brochure or document cover treating finance, probably personal finance.
version 4:…Various British coins; two of £1 value, three of 20p value and one of 1p value. …
version 8:
“A close-up photograph of the following items: A CASIO multi-function, solar powered scientific calculator. A blue ball point pen with a blue rubber grip and the tip extended. Six British coins; two of £1 value, three of 20p value and one of 1p value. Seems to be a theme illustration for a brochure or document cover treating finance - probably personal finance."
![Page 14: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/14.jpg)
Future: Break big task to simple ones and build workflow Running experiment: Crowdsource big tasks (e.g., tourist
guide)
My Boss is a Robot (mybossisarobot.com)Nikki Kittur (Carnegie Mellon) + Jim Giles (New Scientist)
– Identify sights worth checking out (one tip per worker)• Vote and rank
– Brief tips for each monument (one tip per worker)• Vote and rank
– Aggregate tips in meaningful summary• Iterate to improve…
![Page 15: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/15.jpg)
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com
/
Email: [email protected]
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com
/
Email: [email protected]
![Page 16: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/16.jpg)
Correcting biases
Classifying sites as G, PG, R, X Sometimes workers are careful but biased
Classifies G → P and P → R Average error rate : too high
Is she a spammer?Is she a spammer?
Error Rates for CEO of company detecting offensive content (and parent)
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for CEO of company detecting offensive content (and parent)
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
![Page 17: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/17.jpg)
Correcting biases
For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias
True error-rate ~ 9%
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
![Page 18: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/18.jpg)
Scaling Crowdsourcing: Use Machine Learning Human labor is expensive, even when paying cents Need to scale crowdsourcing
Basic idea: Build a machine learning model and use it instead of humans
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
New CaseNew Case Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Automatic
Answer
Automatic
Answer
![Page 19: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/19.jpg)
22
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy Improve data quality Improve classification
Example Case: Porn or not?
40
50
60
70
80
90
100
1 20 40 60 80 100120140160180200220240260280300
Number of examples (Mushroom)
Acc
ura
cy
Data Quality = 50%
Data Quality = 60%
Data Quality = 80%
Data Quality = 100%
![Page 20: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/20.jpg)
Confide
ntConfide
nt
Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Scaling Crowdsourcing: Iterative training
Use machine when confident, humans otherwise
Retrain with new human input → improve model → reduce need for humans
Get human(s)
to answer
Get human(s)
to answer
New CaseNew Case
Not confident
Not confident
Automatic
Answer
Automatic
Answer
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
![Page 21: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/21.jpg)
24
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy Improve data quality Improve classification
Example Case: Porn or not?
40
50
60
70
80
90
100
1 20 40 60 80 100120140160180200220240260280300
Number of examples (Mushroom)
Acc
ura
cy
Data Quality = 50%
Data Quality = 60%
Data Quality = 80%
Data Quality = 100%
![Page 22: Example: Build an “Adult Web Site” Classifier](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56813c02550346895da56141/html5/thumbnails/22.jpg)
Not confident
Not confident
Confiden
tConfiden
t
Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Scaling Crowdsourcing: Iterative training, with noise
Use machine when confident, humans otherwise Ask as many humans as necessary to ensure quality
Get human(s)
to answer
Get human(s)
to answer
New CaseNew Case
Automatic
Answer
Automatic
Answer
Confident for quality?
Not confident
for quality?
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers