2011 crowdsourcing search evaluation

Crowdsourcing search relevance evalua2on at eBay

Brian Johnson September 28, 2011

Agenda

•  Why •  What •  How •  Cost •  Quality •  Measurement

Why Ask Real Humans •  They’re our customers –  Some2mes asking is the best way to find out what you want to know

–  Provide ground truth for automated metrics •  Provide data for –  Experimental Evalua2on

•  complements A/B tes2ng, surveys –  Query Diagnosis –  Judged Test Corpus

•  Machine Learning •  Offline evalua2on

–  Produc2on Quality Control

Why Crowdsourcing

•  Fast – 1-‐3 days

•  Low Cost – pennies per judgment

•  High Quality – Mul2ple workers – Worker evalua2on (test ques2ons & inter-‐worker agreement)

•  Flexible – Ask anything

Judgment Volume by Day

Cost

Judgments Cost 1 $0.01 10 $0.10 100 $1.00

1,000 $10.00 10,000 $100.00 100,000 $1,000.00

1,000,000 $10,000.00

Who are these workers

•  Crowdflower – Mechanical Turk – Gambit/Facebook – TrialPay – SamaSource

•  LiveOps •  CloudCrowd – Facebook

What Can We Evaluate •  Search Ranking

–  Query > Item •  Item/Image Similarity

–  Item > Item •  Merchandising

–  Query > Item –  Category > Item –  Item > Item

•  Product Tagging –  Item > Product

•  Category Recommenda2ons –  Item (Title) > Category

Crowdsourced Search Relevance Evalua2on

•  What are we measuring – Relevance

•  What are we not measuring – Value – Purchase metrics – Revenue

Industry Standard Sample

•  As in the original DCG formula2on, we’ll be using a four-‐point scale for relevance assessment:

•  Irrelevant document (0) •  Marginally relevant document (1) •  Fairly relevant document (2) •  Highly relevant document (3)

hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf

eBay Search Relevance Crowdsourcing

Great Match

Good Match

Not Matching

Quality

•  Tes2ng – Train/test workers before they start – Mix test ques2ons into the work mix – Discard data from unreliable workers

•  Redundancy – Cost is low > Ask mul2ple workers – Monitor inter-‐worker agreement – Have trusted workers monitor new workers – Track worker “feedback” over 2me

eBay @ SIGIR ’10 Ensuring quality in crowdsourced search relevance evalua8on:

The effects of training ques8on distribu8on

John Le, Andy Edmonds, Vaughn Hester, Lukas Biewald

The use of crowdsourcing plaiorms like Amazon Mechanical Turk for evalua2ng the relevance of search results has become an effec2ve strategy that yields results quickly and inexpensively. One approach to ensure quality of worker judgments is to include an ini2al training period and subsequent sporadic inser2on of predefined gold standard data (training data). Workers are no2fied or rejected when they err on the training data, and trust and quality ra2ngs are adjusted accordingly. In this paper, we assess how this type of dynamic learning environment can affect the workers' results in a search relevance evalua2on task completed on Amazon Mechanical Turk. Specifically, we show how the distribu2on of training set answers impacts training of workers and aggregate quality of worker results. We conclude that in a relevance categoriza2on task, a uniform distribu2on of labels across training data labels produces op2mal peaks in 1) individual worker precision and 2) majority vo2ng aggregate result accuracy.

SIGIR ’10, July 19-‐23, 2010, Geneva, Switzerland

Metrics

•  There are standard industry metrics •  Designed to measure value to the end user •  Older metrics –  Precision & recall (binary relevance, no no2on of posi2on)

•  Current metrics –  Cumula2ve Gain (overall value of results on a non-‐binary relevance scale)

– Discounted (adjusted for posi2on value) – Normalized (common 0-‐1 scale)

Judgment Scale Granularity

Binary Web Search SigIR 3 Point 4 Point Offensive -‐1 Spam -‐2 Spam -‐1 Off Topic -‐2 Off Topic 0 Irrelevant Off Topic Irrelevant 0 Not Matching -‐1 Not Matching Relevant Marginally Relevant 1 Relevant Useful Fairly Relevant 1 Matching 1 Good Match Vital Highly Relevant 2 Great Match

Rank Discount

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1 2 3 4 5 6 7 8 9 10

Rank Discount d 1/r^constant

Cumula2ve Gain Metrics

Rank Human

Judgment Cumula8ve

Gain Rank

Discount

Discounted Cumula8ve

Gain

Ideal Rank Order

Observed Ideal DCG Observed

Normalized Discounted Cumula8ve

Gain Observed

Ideal Rank Order

Theore8cal Ideal DCG Theore8cal

Normalized Discounted Cumula8ve

Gain Theore8cal

r j cg d dcg io idcgo ndcgo it idcgt ndcgt

0-‐1 += j 1 / r^c dcg(n-‐1) + j *

d sort(j) dcg(n-‐1) + io

* d dcg(n) / idcgo(n) 1

dcg(n-‐1) + it * d

dcg(n) / idcgt(n)

1 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2 1.0 2.00 0.53 1.53 1.00 1.53 1.00 1.00 1.53 1.00 3 0.8 2.80 0.37 1.83 1.00 1.90 0.96 1.00 1.90 0.96 4 0.0 2.80 0.28 1.83 1.00 2.18 0.84 1.00 2.18 0.84 5 1.0 3.80 0.23 2.06 0.80 2.37 0.87 1.00 2.41 0.85 6 0.2 4.00 0.20 2.10 0.50 2.47 0.85 1.00 2.61 0.80 7 0.2 4.20 0.17 2.13 0.20 2.50 0.85 1.00 2.78 0.77 8 0.5 4.70 0.15 2.21 0.20 2.53 0.87 1.00 2.93 0.75 9 1.0 5.70 0.14 2.34 0.00 2.53 0.93 1.00 3.07 0.76 10 0.0 5.70 0.12 2.34 0.00 2.53 0.93 1.00 3.19 0.73

Con2nuous Produc2on Evalua2on

•  Daily query sampling/scraping to facilitate ongoing monitoring, QA, triage, and post-‐hoc business analysis

NDCG

Time By Site, Category, Query …

Human Judgment > Query List

Best Match Variant Comparison

Measuring a Ranked List

Huan Liu, Lei Tang and Ni2n Agarwal. Tutorial on Community Detec1on and Behavior Study for Social Compu1ng. Presented in The 1st IEEE Interna2onal Conference on Social Compu2ng (SocialCom’09), 2009. hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf

Ranking Evalua2on

hcp://research.microsox.com/en-‐us/um/people/kevynct/files/ECIR-‐2010-‐ML-‐Tutorial-‐FinalToPrint.pdf

NDCG -‐ Example

Huan Liu, Lei Tang and Ni2n Agarwal. Tutorial on Community Detec1on and Behavior Study for Social Compu1ng. Presented in The 1st IEEE Interna2onal Conference on Social Compu2ng (SocialCom’09), 2009. hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf

Open Ques2ons

•  Discrete vs. Con2nuous relevance scale •  # of workers •  Distribu2on of test ques2ons •  Genera2on of test ques2ons •  Qualifica2on (demographics, interests, region) •  Dynamic worker assignment based on qualifica2on

•  Mobile workers (untapped pool)

References

•  Discounted Cumula2ve Gain – hcp://en.wikipedia.org/wiki/Discounted_cumula2ve_gain

•  hcp://crowdflower.com/ •  hcp://www.cloudcrowd.com/ •  hcp://www.trialpay.com

2011 crowdsourcing search evaluation

Technology