did you mean crowdsourcing for recommender systems?

Did you mean crowdsourcing for recommender systems?

OMAR ALONSO

6-OC T-2014

CROWDREC 2014

Disclaimer

The views and opinions expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft.

CROWDREC 2014

OutlineA bit on human computation

Crowdsourcing in information retrieval

Opportunities for recommender systems

CROWDREC 2014

Human Computation

CROWDREC 2014

Human ComputationYou are a computer

CROWDREC 2014

Human-based computationUse humans as processors in a distributed system

Address problems that computers aren’t good

Games with a purpose

Examples◦ ESP game

◦ Captcha

◦ ReCaptcha

CROWDREC 2014

Some definitionsHuman computation is a computation that is performed by a human

Human computation system is a system that organizes human efforts to carry out computation

Crowdsourcing is a tool that a human computation system can use to distribute tasks.

CROWDREC 2014

Edith Law and Luis von Ahn. Human Computation. Morgan & Claypool Publishers, 2011.

HC at the core of RecSys“In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients” – Resnick and Varian (CACM 1997)

S. Perugini, M. Gonçalves, E. Fox: Recommender Systems Research: A Connection-Centric Survey. J. Intell. Inf. Syst. 23(2): 107-143 (2004)

CROWDREC 2014

{where to go on vacation}

MTurk: 50 answers, $1.80

Quora: 2 answers

Y! Answers: 2 answers

FB: 1 answer

Tons of results

Read title + snippet + URL

Explore a few pages in detail

CROWDREC 2014

{where to go on vacation}Countries Cities

CROWDREC 2014

Information Retrieval and Crowdsourcing

CROWDREC 2014

The rise of crowdsourcing in IRCrowdsourcing is hot

Lots of interest in the research community◦ Articles showing good results

◦ Journals special issues (IR, IEEE Internet Computing, etc.)

◦ Workshops and tutorials (SIGIR, NACL, WSDM, WWW, VLDB, RecSys, CHI, etc.)

◦ HCOMP

◦ CrowdConf

Large companies using crowdsourcing

Big data

Start-ups

Venture capital investment

CROWDREC 2014

Why is this interesting?Easy to prototype and test new experiments

Cheap and fast

No need to setup infrastructure

Introduce experimentation early in the cycle

In the context of IR, implement and experiment as you go

For new ideas, this is very helpful

CROWDREC 2014

Caveats and clarificationsTrust and reliability

Wisdom of the crowd re-visit

Adjust expectations

Crowdsourcing is another data point for your analysis

Complementary to other experiments

CROWDREC 2014

Why now?The Web

Use humans as processors in a distributed system

Address problems that computers aren’t good

Scale

Reach

CROWDREC 2014

Motivating example: relevance judgingRelevance of search results is difficult to judge

◦ Highly subjective

◦ Expensive to measure

Professional editors commonly used

Potential benefits of crowdsourcing◦ Scalability (time and cost)

◦ Diversity of judgments

CROWDREC 2014

Matt Lease and Omar Alonso. “Crowdsourcing for search evaluation and social-algorithmic search”, ACM SIGIR 2012 Tutorial.

Crowdsourcing and relevance evaluationFor relevance, it combines two main approaches

◦ Explicit judgments

◦ Automated metrics

Other features◦ Large scale

◦ Inexpensive

◦ Diversity

CROWDREC 2014

Development frameworkIncremental approach

Measure, evaluate, and adjust as you go

Suitable for repeatable tasks

CROWDREC 2014

O. Alonso. “Implementing crowdsourcing-based relevance experimentation: an industrial perspective". Information Retrieval, (16)2, 2013

Asking questionsAsk the right questions

Part art, part science

Instructions are key

Workers may not be IR experts so don’t assume the same understanding in terms of terminology

Show examples

Hire a technical writer◦ Engineer writes the specification

◦ Writer communicates

CROWDREC 2014

N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.

UX designTime to apply all those usability concepts

Experiment should be self-contained.

Keep it short and simple. Brief and concise.

Be very clear with the relevance task.

Engage with the worker. Avoid boring stuff.

Document presentation & design

Need to grab attention

Always ask for feedback (open-ended question) in an input box.

Localization

CROWDREC 2014

Other design principlesText alignment

Legibility

Reading level: complexity of words and sentences

Attractiveness (worker’s attention & enjoyment)

Multi-cultural / multi-lingual

Who is the audience (e.g. target worker community)

Special needs communities (e.g. simple color blindness)

Cognitive load: mental rigor needed to perform task

Exposure effect

CROWDREC 2014

When to assess work quality?Beforehand (prior to main task activity)

◦ How: “qualification tests” or similar mechanism

◦ Purpose: screening, selection, recruiting, training

During◦ How: assess labels as worker produces them

◦ Like random checks on a manufacturing line

◦ Purpose: calibrate, reward/penalize, weight

After◦ How: compute accuracy metrics post-hoc

◦ Purpose: filter, calibrate, weight, retain (HR)

CROWDREC 2014

How do we measure work quality?Compare worker’s label vs.

◦ Known (correct, trusted) label

◦ Other workers’ labels

◦ Model predictions of workers and labels

Verify worker’s label◦ Yourself

◦ Tiered approach (e.g. Find-Fix-Verify)

CROWDREC 2014

Comparing to known answersAKA: gold, honey pot, verifiable answer, trap

Assumes you have known answers

Cost vs. Benefit◦ Producing known answers (experts?)

◦ % of work spent re-producing them

Finer points◦ What if workers recognize the honey pots?

CROWDREC 2014

Comparing to other workersAKA: consensus, plurality, redundant labeling

Well-known metrics for measuring agreement

Cost vs. Benefit: % of work that is redundant

Finer points◦ Is consensus “truth” or systematic bias of group?

◦ What if no one really knows what they’re doing?◦ Low-agreement across workers indicates problem is with the task (or a specific example), not the workers

CROWDREC 2014

Methods for measuring agreementWhat to look for

◦ Agreement, reliability, validity

Inter-agreement level◦ Agreement between judges◦ Agreement between judges and the gold set

Some statistics◦ Percentage agreement◦ Cohen’s kappa (2 raters)◦ Fleiss’ kappa (any number of raters)◦ Krippendorff’s alpha

With majority vote, what if 2 say relevant, 3 say not? ◦ Use expert to break ties ◦ Collect more judgments as needed to reduce uncertainty

CROWDREC 2014

PauseCrowdsourcing works

◦ Fast turnaround, easy to experiment, few dollars to test

◦ But: you have to design experiments carefully, quality, platform limitations

Crowdsourcing in production◦ Large scale data sets (millions of labels)

◦ Continuous execution

◦ Difficult to debug

Multiple contingent factors

How do you know the experiment is working

Goal: framework for ensuring reliability on crowdsourcing tasks

CROWDREC 2014

O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.

Labeling tweets – an example of a taskIs this tweet interesting?

Subjective activity

Not focused on specific events

Findings◦ Difficult problem, low inter-rater agreement (Fleiss’ k, Krippendorff’s alpha)

◦ Tested many designs, number of workers, platforms (MTurk and others)

Multiple contingent factors◦ Worker performance

◦ Work

◦ Task design

CROWDREC 2014

O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.

Designs that include in-task CAPTCHABorrowed idea from reCAPTCHA -> use of control term

Adapt your labeling task

2 more questions as control◦ 1 algorithmic

◦ 1 semantic

CROWDREC 2014

Production example #1

CROWDREC 2014

Q1 (k = 0.91, alpha = 0.91)

Q2 (k = 0.771, alpha = 0.771)

Q3 (k = 0.033, alpha = 0.035)

In-task captcha

Tweet de-branded

The main question

Production example #2

CROWDREC 2014

• Q3 Worthless (alpha = 0.033)• Q3 Trivial (alpha = 0.043)• Q3 Funny (alpha = -0.016)• Q3 Makes me curious (alpha = 0.026)• Q3 Contains useful info (alpha = 0.048)• Q3 Important news (alpha = 0.207)

Q2 (k = 0.728, alpha = 0.728)

Q1 (k = 0.907, alpha = 0.907)

In-task captcha

Breakdown by categories to get better signal

Tweet de-branded

Findings from designsNo quality control issues

Eliminating workers who did a poor job on question #1 didn’t affect inter-rater agreement for question #2 and #3.

Interestingness is a fully subjective notion

We can still build a classifier that identifies tweets that are interesting to a majority of users

CROWDREC 2014

Careful with That Axe Data, EugeneIn the area of big data and machine learning:

◦ labels -> features -> predictive model -> optimization

Labeling/experimentation perceived as boring

Don’t rush labeling◦ Human and machine

Label quality is very important ◦ Don’t outsource it

◦ Own it end to end

◦ Large scale

CROWDREC 2014

More on label qualityData gathering is not a free lunch

You can’t outsource label acquisition and quality

Labels for the machine != labels for humans

Emphasis on algorithms, models/optimizations and mining from labels

Not so much on algorithms for ensuring high quality labels

Training sets

CROWDREC 2014

People are more than HPUsWhy is Facebook popular? People are social.

Information needs are contextually grounded in our social experiences and social networks

Our social networks also embody additional knowledge about us, our needs, and the world

We relate to recommendations

The social dimension complements computation

CROWDREC 2014

Opportunities in RecSys

CROWDREC 2014

Humans in the loopComputation loops that mix humans and machines

Kind of active learning

Double goal:◦ Human checking on the machine

◦ Machine checking on humans

Example: classifiers for social data

CROWDREC 2014

Collaborative Filtering v2Collaboration with recipients

Interactive

Learning new data

CROWDREC 2014

What’s in a label?Clicks, reviews, ratings, etc.

Better or novel systems if we focus more on label quality?

New ways of collecting data

Training sets

Evaluation & measurement

CROWDREC 2014

RoutingExpertise detection and routing

Social load balancing

When to switch between machines and humans

CROWDREC 2014

ConclusionsCrowdsourcing at scale works but requires a solid framework

Three aspects that need attention: workers, work and task design

Labeling social data is hard

Traditional IR approaches don’t seem to work for Twitter data

Label quality

Outlined areas where RecSys can benefit from crowdsourcing

CROWDREC 2014

Thank you

CROWDREC 2014

@elunca

did you mean crowdsourcing for recommender systems?

Software

human human computation

human computationcrowdsourcing

human computationcrowdrec

human computationyou

human efforts

rise of crowdsourcing

definitionshuman computation

recommender systems