2011 crowdsourcing search evaluation
DESCRIPTION
TRANSCRIPT
![Page 1: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/1.jpg)
Crowdsourcing search relevance evalua2on at eBay
Brian Johnson September 28, 2011
![Page 2: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/2.jpg)
Agenda
• Why • What • How • Cost • Quality • Measurement
![Page 3: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/3.jpg)
Why Ask Real Humans • They’re our customers – Some2mes asking is the best way to find out what you want to know
– Provide ground truth for automated metrics • Provide data for – Experimental Evalua2on
• complements A/B tes2ng, surveys – Query Diagnosis – Judged Test Corpus
• Machine Learning • Offline evalua2on
– Produc2on Quality Control
![Page 4: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/4.jpg)
Why Crowdsourcing
• Fast – 1-‐3 days
• Low Cost – pennies per judgment
• High Quality – Mul2ple workers – Worker evalua2on (test ques2ons & inter-‐worker agreement)
• Flexible – Ask anything
![Page 5: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/5.jpg)
Judgment Volume by Day
![Page 6: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/6.jpg)
Cost
Judgments Cost 1 $0.01 10 $0.10 100 $1.00
1,000 $10.00 10,000 $100.00 100,000 $1,000.00
1,000,000 $10,000.00
![Page 7: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/7.jpg)
Who are these workers
• Crowdflower – Mechanical Turk – Gambit/Facebook – TrialPay – SamaSource
• LiveOps • CloudCrowd – Facebook
![Page 8: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/8.jpg)
What Can We Evaluate • Search Ranking
– Query > Item • Item/Image Similarity
– Item > Item • Merchandising
– Query > Item – Category > Item – Item > Item
• Product Tagging – Item > Product
• Category Recommenda2ons – Item (Title) > Category
![Page 9: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/9.jpg)
Crowdsourced Search Relevance Evalua2on
• What are we measuring – Relevance
• What are we not measuring – Value – Purchase metrics – Revenue
![Page 10: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/10.jpg)
Industry Standard Sample
• As in the original DCG formula2on, we’ll be using a four-‐point scale for relevance assessment:
• Irrelevant document (0) • Marginally relevant document (1) • Fairly relevant document (2) • Highly relevant document (3)
hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf
![Page 11: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/11.jpg)
eBay Search Relevance Crowdsourcing
![Page 12: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/12.jpg)
Great Match
![Page 13: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/13.jpg)
Good Match
![Page 14: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/14.jpg)
Not Matching
![Page 15: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/15.jpg)
Quality
• Tes2ng – Train/test workers before they start – Mix test ques2ons into the work mix – Discard data from unreliable workers
• Redundancy – Cost is low > Ask mul2ple workers – Monitor inter-‐worker agreement – Have trusted workers monitor new workers – Track worker “feedback” over 2me
![Page 16: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/16.jpg)
eBay @ SIGIR ’10 Ensuring quality in crowdsourced search relevance evalua8on:
The effects of training ques8on distribu8on
John Le, Andy Edmonds, Vaughn Hester, Lukas Biewald
The use of crowdsourcing plaiorms like Amazon Mechanical Turk for evalua2ng the relevance of search results has become an effec2ve strategy that yields results quickly and inexpensively. One approach to ensure quality of worker judgments is to include an ini2al training period and subsequent sporadic inser2on of predefined gold standard data (training data). Workers are no2fied or rejected when they err on the training data, and trust and quality ra2ngs are adjusted accordingly. In this paper, we assess how this type of dynamic learning environment can affect the workers' results in a search relevance evalua2on task completed on Amazon Mechanical Turk. Specifically, we show how the distribu2on of training set answers impacts training of workers and aggregate quality of worker results. We conclude that in a relevance categoriza2on task, a uniform distribu2on of labels across training data labels produces op2mal peaks in 1) individual worker precision and 2) majority vo2ng aggregate result accuracy.
SIGIR ’10, July 19-‐23, 2010, Geneva, Switzerland
![Page 17: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/17.jpg)
Metrics
• There are standard industry metrics • Designed to measure value to the end user • Older metrics – Precision & recall (binary relevance, no no2on of posi2on)
• Current metrics – Cumula2ve Gain (overall value of results on a non-‐binary relevance scale)
– Discounted (adjusted for posi2on value) – Normalized (common 0-‐1 scale)
![Page 18: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/18.jpg)
Judgment Scale Granularity
Binary Web Search SigIR 3 Point 4 Point Offensive -‐1 Spam -‐2 Spam -‐1 Off Topic -‐2 Off Topic 0 Irrelevant Off Topic Irrelevant 0 Not Matching -‐1 Not Matching Relevant Marginally Relevant 1 Relevant Useful Fairly Relevant 1 Matching 1 Good Match Vital Highly Relevant 2 Great Match
![Page 19: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/19.jpg)
Rank Discount
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1 2 3 4 5 6 7 8 9 10
Rank Discount d 1/r^constant
![Page 20: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/20.jpg)
Cumula2ve Gain Metrics
Rank Human
Judgment Cumula8ve
Gain Rank
Discount
Discounted Cumula8ve
Gain
Ideal Rank Order
Observed Ideal DCG Observed
Normalized Discounted Cumula8ve
Gain Observed
Ideal Rank Order
Theore8cal Ideal DCG Theore8cal
Normalized Discounted Cumula8ve
Gain Theore8cal
r j cg d dcg io idcgo ndcgo it idcgt ndcgt
0-‐1 += j 1 / r^c dcg(n-‐1) + j *
d sort(j) dcg(n-‐1) + io
* d dcg(n) / idcgo(n) 1
dcg(n-‐1) + it * d
dcg(n) / idcgt(n)
1 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2 1.0 2.00 0.53 1.53 1.00 1.53 1.00 1.00 1.53 1.00 3 0.8 2.80 0.37 1.83 1.00 1.90 0.96 1.00 1.90 0.96 4 0.0 2.80 0.28 1.83 1.00 2.18 0.84 1.00 2.18 0.84 5 1.0 3.80 0.23 2.06 0.80 2.37 0.87 1.00 2.41 0.85 6 0.2 4.00 0.20 2.10 0.50 2.47 0.85 1.00 2.61 0.80 7 0.2 4.20 0.17 2.13 0.20 2.50 0.85 1.00 2.78 0.77 8 0.5 4.70 0.15 2.21 0.20 2.53 0.87 1.00 2.93 0.75 9 1.0 5.70 0.14 2.34 0.00 2.53 0.93 1.00 3.07 0.76 10 0.0 5.70 0.12 2.34 0.00 2.53 0.93 1.00 3.19 0.73
![Page 21: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/21.jpg)
Con2nuous Produc2on Evalua2on
• Daily query sampling/scraping to facilitate ongoing monitoring, QA, triage, and post-‐hoc business analysis
NDCG
Time By Site, Category, Query …
![Page 22: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/22.jpg)
Human Judgment > Query List
![Page 23: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/23.jpg)
Best Match Variant Comparison
![Page 24: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/24.jpg)
Best Match Variant Comparison
![Page 25: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/25.jpg)
Measuring a Ranked List
Huan Liu, Lei Tang and Ni2n Agarwal. Tutorial on Community Detec1on and Behavior Study for Social Compu1ng. Presented in The 1st IEEE Interna2onal Conference on Social Compu2ng (SocialCom’09), 2009. hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf
![Page 26: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/26.jpg)
Ranking Evalua2on
hcp://research.microsox.com/en-‐us/um/people/kevynct/files/ECIR-‐2010-‐ML-‐Tutorial-‐FinalToPrint.pdf
![Page 27: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/27.jpg)
NDCG -‐ Example
Huan Liu, Lei Tang and Ni2n Agarwal. Tutorial on Community Detec1on and Behavior Study for Social Compu1ng. Presented in The 1st IEEE Interna2onal Conference on Social Compu2ng (SocialCom’09), 2009. hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf
![Page 28: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/28.jpg)
Open Ques2ons
• Discrete vs. Con2nuous relevance scale • # of workers • Distribu2on of test ques2ons • Genera2on of test ques2ons • Qualifica2on (demographics, interests, region) • Dynamic worker assignment based on qualifica2on
• Mobile workers (untapped pool)
![Page 29: 2011 Crowdsourcing Search Evaluation](https://reader034.vdocuments.mx/reader034/viewer/2022051610/548177adb479593c578b48e7/html5/thumbnails/29.jpg)
References
• Discounted Cumula2ve Gain – hcp://en.wikipedia.org/wiki/Discounted_cumula2ve_gain
• hcp://crowdflower.com/ • hcp://www.cloudcrowd.com/ • hcp://www.trialpay.com