did you mean crowdsourcing for recommender systems?

42
Did you mean crowdsourcing for recommender systems? OMAR ALONSO 6-OCT-2014 CROWDREC 2014

Upload: oralonso

Post on 17-Jul-2015

137 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Did you mean crowdsourcing for recommender systems?

Did you mean crowdsourcing for recommender systems?

OMAR ALONSO

6-OC T-2014

CROWDREC 2014

Page 2: Did you mean crowdsourcing for recommender systems?

Disclaimer

The views and opinions expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft.

CROWDREC 2014

Page 3: Did you mean crowdsourcing for recommender systems?

OutlineA bit on human computation

Crowdsourcing in information retrieval

Opportunities for recommender systems

CROWDREC 2014

Page 4: Did you mean crowdsourcing for recommender systems?

Human Computation

CROWDREC 2014

Page 5: Did you mean crowdsourcing for recommender systems?

Human ComputationYou are a computer

CROWDREC 2014

Page 6: Did you mean crowdsourcing for recommender systems?

Human-based computationUse humans as processors in a distributed system

Address problems that computers aren’t good

Games with a purpose

Examples◦ ESP game

◦ Captcha

◦ ReCaptcha

CROWDREC 2014

Page 7: Did you mean crowdsourcing for recommender systems?

Some definitionsHuman computation is a computation that is performed by a human

Human computation system is a system that organizes human efforts to carry out computation

Crowdsourcing is a tool that a human computation system can use to distribute tasks.

CROWDREC 2014

Edith Law and Luis von Ahn. Human Computation. Morgan & Claypool Publishers, 2011.

Page 8: Did you mean crowdsourcing for recommender systems?

HC at the core of RecSys“In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients” – Resnick and Varian (CACM 1997)

S. Perugini, M. Gonçalves, E. Fox: Recommender Systems Research: A Connection-Centric Survey. J. Intell. Inf. Syst. 23(2): 107-143 (2004)

CROWDREC 2014

Page 9: Did you mean crowdsourcing for recommender systems?

{where to go on vacation}

MTurk: 50 answers, $1.80

Quora: 2 answers

Y! Answers: 2 answers

FB: 1 answer

Tons of results

Read title + snippet + URL

Explore a few pages in detail

CROWDREC 2014

Page 10: Did you mean crowdsourcing for recommender systems?

{where to go on vacation}Countries Cities

CROWDREC 2014

Page 11: Did you mean crowdsourcing for recommender systems?

Information Retrieval and Crowdsourcing

CROWDREC 2014

Page 12: Did you mean crowdsourcing for recommender systems?

The rise of crowdsourcing in IRCrowdsourcing is hot

Lots of interest in the research community◦ Articles showing good results

◦ Journals special issues (IR, IEEE Internet Computing, etc.)

◦ Workshops and tutorials (SIGIR, NACL, WSDM, WWW, VLDB, RecSys, CHI, etc.)

◦ HCOMP

◦ CrowdConf

Large companies using crowdsourcing

Big data

Start-ups

Venture capital investment

CROWDREC 2014

Page 13: Did you mean crowdsourcing for recommender systems?

Why is this interesting?Easy to prototype and test new experiments

Cheap and fast

No need to setup infrastructure

Introduce experimentation early in the cycle

In the context of IR, implement and experiment as you go

For new ideas, this is very helpful

CROWDREC 2014

Page 14: Did you mean crowdsourcing for recommender systems?

Caveats and clarificationsTrust and reliability

Wisdom of the crowd re-visit

Adjust expectations

Crowdsourcing is another data point for your analysis

Complementary to other experiments

CROWDREC 2014

Page 15: Did you mean crowdsourcing for recommender systems?

Why now?The Web

Use humans as processors in a distributed system

Address problems that computers aren’t good

Scale

Reach

CROWDREC 2014

Page 16: Did you mean crowdsourcing for recommender systems?

Motivating example: relevance judgingRelevance of search results is difficult to judge

◦ Highly subjective

◦ Expensive to measure

Professional editors commonly used

Potential benefits of crowdsourcing◦ Scalability (time and cost)

◦ Diversity of judgments

CROWDREC 2014

Matt Lease and Omar Alonso. “Crowdsourcing for search evaluation and social-algorithmic search”, ACM SIGIR 2012 Tutorial.

Page 17: Did you mean crowdsourcing for recommender systems?

Crowdsourcing and relevance evaluationFor relevance, it combines two main approaches

◦ Explicit judgments

◦ Automated metrics

Other features◦ Large scale

◦ Inexpensive

◦ Diversity

CROWDREC 2014

Page 18: Did you mean crowdsourcing for recommender systems?

Development frameworkIncremental approach

Measure, evaluate, and adjust as you go

Suitable for repeatable tasks

CROWDREC 2014

O. Alonso. “Implementing crowdsourcing-based relevance experimentation: an industrial perspective". Information Retrieval, (16)2, 2013

Page 19: Did you mean crowdsourcing for recommender systems?

Asking questionsAsk the right questions

Part art, part science

Instructions are key

Workers may not be IR experts so don’t assume the same understanding in terms of terminology

Show examples

Hire a technical writer◦ Engineer writes the specification

◦ Writer communicates

CROWDREC 2014

N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.

Page 20: Did you mean crowdsourcing for recommender systems?

UX designTime to apply all those usability concepts

Experiment should be self-contained.

Keep it short and simple. Brief and concise.

Be very clear with the relevance task.

Engage with the worker. Avoid boring stuff.

Document presentation & design

Need to grab attention

Always ask for feedback (open-ended question) in an input box.

Localization

CROWDREC 2014

Page 21: Did you mean crowdsourcing for recommender systems?

Other design principlesText alignment

Legibility

Reading level: complexity of words and sentences

Attractiveness (worker’s attention & enjoyment)

Multi-cultural / multi-lingual

Who is the audience (e.g. target worker community)

Special needs communities (e.g. simple color blindness)

Cognitive load: mental rigor needed to perform task

Exposure effect

CROWDREC 2014

Page 22: Did you mean crowdsourcing for recommender systems?

When to assess work quality?Beforehand (prior to main task activity)

◦ How: “qualification tests” or similar mechanism

◦ Purpose: screening, selection, recruiting, training

During◦ How: assess labels as worker produces them

◦ Like random checks on a manufacturing line

◦ Purpose: calibrate, reward/penalize, weight

After◦ How: compute accuracy metrics post-hoc

◦ Purpose: filter, calibrate, weight, retain (HR)

CROWDREC 2014

Page 23: Did you mean crowdsourcing for recommender systems?

How do we measure work quality?Compare worker’s label vs.

◦ Known (correct, trusted) label

◦ Other workers’ labels

◦ Model predictions of workers and labels

Verify worker’s label◦ Yourself

◦ Tiered approach (e.g. Find-Fix-Verify)

CROWDREC 2014

Page 24: Did you mean crowdsourcing for recommender systems?

Comparing to known answersAKA: gold, honey pot, verifiable answer, trap

Assumes you have known answers

Cost vs. Benefit◦ Producing known answers (experts?)

◦ % of work spent re-producing them

Finer points◦ What if workers recognize the honey pots?

CROWDREC 2014

Page 25: Did you mean crowdsourcing for recommender systems?

Comparing to other workersAKA: consensus, plurality, redundant labeling

Well-known metrics for measuring agreement

Cost vs. Benefit: % of work that is redundant

Finer points◦ Is consensus “truth” or systematic bias of group?

◦ What if no one really knows what they’re doing?◦ Low-agreement across workers indicates problem is with the task (or a specific example), not the workers

CROWDREC 2014

Page 26: Did you mean crowdsourcing for recommender systems?

Methods for measuring agreementWhat to look for

◦ Agreement, reliability, validity

Inter-agreement level◦ Agreement between judges◦ Agreement between judges and the gold set

Some statistics◦ Percentage agreement◦ Cohen’s kappa (2 raters)◦ Fleiss’ kappa (any number of raters)◦ Krippendorff’s alpha

With majority vote, what if 2 say relevant, 3 say not? ◦ Use expert to break ties ◦ Collect more judgments as needed to reduce uncertainty

CROWDREC 2014

Page 27: Did you mean crowdsourcing for recommender systems?

PauseCrowdsourcing works

◦ Fast turnaround, easy to experiment, few dollars to test

◦ But: you have to design experiments carefully, quality, platform limitations

Crowdsourcing in production◦ Large scale data sets (millions of labels)

◦ Continuous execution

◦ Difficult to debug

Multiple contingent factors

How do you know the experiment is working

Goal: framework for ensuring reliability on crowdsourcing tasks

CROWDREC 2014

O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.

Page 28: Did you mean crowdsourcing for recommender systems?

Labeling tweets – an example of a taskIs this tweet interesting?

Subjective activity

Not focused on specific events

Findings◦ Difficult problem, low inter-rater agreement (Fleiss’ k, Krippendorff’s alpha)

◦ Tested many designs, number of workers, platforms (MTurk and others)

Multiple contingent factors◦ Worker performance

◦ Work

◦ Task design

CROWDREC 2014

O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.

Page 29: Did you mean crowdsourcing for recommender systems?

Designs that include in-task CAPTCHABorrowed idea from reCAPTCHA -> use of control term

Adapt your labeling task

2 more questions as control◦ 1 algorithmic

◦ 1 semantic

CROWDREC 2014

Page 30: Did you mean crowdsourcing for recommender systems?

Production example #1

CROWDREC 2014

Q1 (k = 0.91, alpha = 0.91)

Q2 (k = 0.771, alpha = 0.771)

Q3 (k = 0.033, alpha = 0.035)

In-task captcha

Tweet de-branded

The main question

Page 31: Did you mean crowdsourcing for recommender systems?

Production example #2

CROWDREC 2014

• Q3 Worthless (alpha = 0.033)• Q3 Trivial (alpha = 0.043)• Q3 Funny (alpha = -0.016)• Q3 Makes me curious (alpha = 0.026)• Q3 Contains useful info (alpha = 0.048)• Q3 Important news (alpha = 0.207)

Q2 (k = 0.728, alpha = 0.728)

Q1 (k = 0.907, alpha = 0.907)

In-task captcha

Breakdown by categories to get better signal

Tweet de-branded

Page 32: Did you mean crowdsourcing for recommender systems?

Findings from designsNo quality control issues

Eliminating workers who did a poor job on question #1 didn’t affect inter-rater agreement for question #2 and #3.

Interestingness is a fully subjective notion

We can still build a classifier that identifies tweets that are interesting to a majority of users

CROWDREC 2014

Page 33: Did you mean crowdsourcing for recommender systems?

Careful with That Axe Data, EugeneIn the area of big data and machine learning:

◦ labels -> features -> predictive model -> optimization

Labeling/experimentation perceived as boring

Don’t rush labeling◦ Human and machine

Label quality is very important ◦ Don’t outsource it

◦ Own it end to end

◦ Large scale

CROWDREC 2014

Page 34: Did you mean crowdsourcing for recommender systems?

More on label qualityData gathering is not a free lunch

You can’t outsource label acquisition and quality

Labels for the machine != labels for humans

Emphasis on algorithms, models/optimizations and mining from labels

Not so much on algorithms for ensuring high quality labels

Training sets

CROWDREC 2014

Page 35: Did you mean crowdsourcing for recommender systems?

People are more than HPUsWhy is Facebook popular? People are social.

Information needs are contextually grounded in our social experiences and social networks

Our social networks also embody additional knowledge about us, our needs, and the world

We relate to recommendations

The social dimension complements computation

CROWDREC 2014

Page 36: Did you mean crowdsourcing for recommender systems?

Opportunities in RecSys

CROWDREC 2014

Page 37: Did you mean crowdsourcing for recommender systems?

Humans in the loopComputation loops that mix humans and machines

Kind of active learning

Double goal:◦ Human checking on the machine

◦ Machine checking on humans

Example: classifiers for social data

CROWDREC 2014

Page 38: Did you mean crowdsourcing for recommender systems?

Collaborative Filtering v2Collaboration with recipients

Interactive

Learning new data

CROWDREC 2014

Page 39: Did you mean crowdsourcing for recommender systems?

What’s in a label?Clicks, reviews, ratings, etc.

Better or novel systems if we focus more on label quality?

New ways of collecting data

Training sets

Evaluation & measurement

CROWDREC 2014

Page 40: Did you mean crowdsourcing for recommender systems?

RoutingExpertise detection and routing

Social load balancing

When to switch between machines and humans

CROWDREC 2014

Page 41: Did you mean crowdsourcing for recommender systems?

ConclusionsCrowdsourcing at scale works but requires a solid framework

Three aspects that need attention: workers, work and task design

Labeling social data is hard

Traditional IR approaches don’t seem to work for Twitter data

Label quality

Outlined areas where RecSys can benefit from crowdsourcing

CROWDREC 2014

Page 42: Did you mean crowdsourcing for recommender systems?

Thank you

CROWDREC 2014

@elunca