did you mean crowdsourcing for recommender systems?
TRANSCRIPT
Did you mean crowdsourcing for recommender systems?
OMAR ALONSO
6-OC T-2014
CROWDREC 2014
Disclaimer
The views and opinions expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft.
CROWDREC 2014
OutlineA bit on human computation
Crowdsourcing in information retrieval
Opportunities for recommender systems
CROWDREC 2014
Human Computation
CROWDREC 2014
Human ComputationYou are a computer
CROWDREC 2014
Human-based computationUse humans as processors in a distributed system
Address problems that computers aren’t good
Games with a purpose
Examples◦ ESP game
◦ Captcha
◦ ReCaptcha
CROWDREC 2014
Some definitionsHuman computation is a computation that is performed by a human
Human computation system is a system that organizes human efforts to carry out computation
Crowdsourcing is a tool that a human computation system can use to distribute tasks.
CROWDREC 2014
Edith Law and Luis von Ahn. Human Computation. Morgan & Claypool Publishers, 2011.
HC at the core of RecSys“In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients” – Resnick and Varian (CACM 1997)
S. Perugini, M. Gonçalves, E. Fox: Recommender Systems Research: A Connection-Centric Survey. J. Intell. Inf. Syst. 23(2): 107-143 (2004)
CROWDREC 2014
{where to go on vacation}
MTurk: 50 answers, $1.80
Quora: 2 answers
Y! Answers: 2 answers
FB: 1 answer
Tons of results
Read title + snippet + URL
Explore a few pages in detail
CROWDREC 2014
{where to go on vacation}Countries Cities
CROWDREC 2014
Information Retrieval and Crowdsourcing
CROWDREC 2014
The rise of crowdsourcing in IRCrowdsourcing is hot
Lots of interest in the research community◦ Articles showing good results
◦ Journals special issues (IR, IEEE Internet Computing, etc.)
◦ Workshops and tutorials (SIGIR, NACL, WSDM, WWW, VLDB, RecSys, CHI, etc.)
◦ HCOMP
◦ CrowdConf
Large companies using crowdsourcing
Big data
Start-ups
Venture capital investment
CROWDREC 2014
Why is this interesting?Easy to prototype and test new experiments
Cheap and fast
No need to setup infrastructure
Introduce experimentation early in the cycle
In the context of IR, implement and experiment as you go
For new ideas, this is very helpful
CROWDREC 2014
Caveats and clarificationsTrust and reliability
Wisdom of the crowd re-visit
Adjust expectations
Crowdsourcing is another data point for your analysis
Complementary to other experiments
CROWDREC 2014
Why now?The Web
Use humans as processors in a distributed system
Address problems that computers aren’t good
Scale
Reach
CROWDREC 2014
Motivating example: relevance judgingRelevance of search results is difficult to judge
◦ Highly subjective
◦ Expensive to measure
Professional editors commonly used
Potential benefits of crowdsourcing◦ Scalability (time and cost)
◦ Diversity of judgments
CROWDREC 2014
Matt Lease and Omar Alonso. “Crowdsourcing for search evaluation and social-algorithmic search”, ACM SIGIR 2012 Tutorial.
Crowdsourcing and relevance evaluationFor relevance, it combines two main approaches
◦ Explicit judgments
◦ Automated metrics
Other features◦ Large scale
◦ Inexpensive
◦ Diversity
CROWDREC 2014
Development frameworkIncremental approach
Measure, evaluate, and adjust as you go
Suitable for repeatable tasks
CROWDREC 2014
O. Alonso. “Implementing crowdsourcing-based relevance experimentation: an industrial perspective". Information Retrieval, (16)2, 2013
Asking questionsAsk the right questions
Part art, part science
Instructions are key
Workers may not be IR experts so don’t assume the same understanding in terms of terminology
Show examples
Hire a technical writer◦ Engineer writes the specification
◦ Writer communicates
CROWDREC 2014
N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
UX designTime to apply all those usability concepts
Experiment should be self-contained.
Keep it short and simple. Brief and concise.
Be very clear with the relevance task.
Engage with the worker. Avoid boring stuff.
Document presentation & design
Need to grab attention
Always ask for feedback (open-ended question) in an input box.
Localization
CROWDREC 2014
Other design principlesText alignment
Legibility
Reading level: complexity of words and sentences
Attractiveness (worker’s attention & enjoyment)
Multi-cultural / multi-lingual
Who is the audience (e.g. target worker community)
Special needs communities (e.g. simple color blindness)
Cognitive load: mental rigor needed to perform task
Exposure effect
CROWDREC 2014
When to assess work quality?Beforehand (prior to main task activity)
◦ How: “qualification tests” or similar mechanism
◦ Purpose: screening, selection, recruiting, training
During◦ How: assess labels as worker produces them
◦ Like random checks on a manufacturing line
◦ Purpose: calibrate, reward/penalize, weight
After◦ How: compute accuracy metrics post-hoc
◦ Purpose: filter, calibrate, weight, retain (HR)
CROWDREC 2014
How do we measure work quality?Compare worker’s label vs.
◦ Known (correct, trusted) label
◦ Other workers’ labels
◦ Model predictions of workers and labels
Verify worker’s label◦ Yourself
◦ Tiered approach (e.g. Find-Fix-Verify)
CROWDREC 2014
Comparing to known answersAKA: gold, honey pot, verifiable answer, trap
Assumes you have known answers
Cost vs. Benefit◦ Producing known answers (experts?)
◦ % of work spent re-producing them
Finer points◦ What if workers recognize the honey pots?
CROWDREC 2014
Comparing to other workersAKA: consensus, plurality, redundant labeling
Well-known metrics for measuring agreement
Cost vs. Benefit: % of work that is redundant
Finer points◦ Is consensus “truth” or systematic bias of group?
◦ What if no one really knows what they’re doing?◦ Low-agreement across workers indicates problem is with the task (or a specific example), not the workers
CROWDREC 2014
Methods for measuring agreementWhat to look for
◦ Agreement, reliability, validity
Inter-agreement level◦ Agreement between judges◦ Agreement between judges and the gold set
Some statistics◦ Percentage agreement◦ Cohen’s kappa (2 raters)◦ Fleiss’ kappa (any number of raters)◦ Krippendorff’s alpha
With majority vote, what if 2 say relevant, 3 say not? ◦ Use expert to break ties ◦ Collect more judgments as needed to reduce uncertainty
CROWDREC 2014
PauseCrowdsourcing works
◦ Fast turnaround, easy to experiment, few dollars to test
◦ But: you have to design experiments carefully, quality, platform limitations
Crowdsourcing in production◦ Large scale data sets (millions of labels)
◦ Continuous execution
◦ Difficult to debug
Multiple contingent factors
How do you know the experiment is working
Goal: framework for ensuring reliability on crowdsourcing tasks
CROWDREC 2014
O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
Labeling tweets – an example of a taskIs this tweet interesting?
Subjective activity
Not focused on specific events
Findings◦ Difficult problem, low inter-rater agreement (Fleiss’ k, Krippendorff’s alpha)
◦ Tested many designs, number of workers, platforms (MTurk and others)
Multiple contingent factors◦ Worker performance
◦ Work
◦ Task design
CROWDREC 2014
O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
Designs that include in-task CAPTCHABorrowed idea from reCAPTCHA -> use of control term
Adapt your labeling task
2 more questions as control◦ 1 algorithmic
◦ 1 semantic
CROWDREC 2014
Production example #1
CROWDREC 2014
Q1 (k = 0.91, alpha = 0.91)
Q2 (k = 0.771, alpha = 0.771)
Q3 (k = 0.033, alpha = 0.035)
In-task captcha
Tweet de-branded
The main question
Production example #2
CROWDREC 2014
• Q3 Worthless (alpha = 0.033)• Q3 Trivial (alpha = 0.043)• Q3 Funny (alpha = -0.016)• Q3 Makes me curious (alpha = 0.026)• Q3 Contains useful info (alpha = 0.048)• Q3 Important news (alpha = 0.207)
Q2 (k = 0.728, alpha = 0.728)
Q1 (k = 0.907, alpha = 0.907)
In-task captcha
Breakdown by categories to get better signal
Tweet de-branded
Findings from designsNo quality control issues
Eliminating workers who did a poor job on question #1 didn’t affect inter-rater agreement for question #2 and #3.
Interestingness is a fully subjective notion
We can still build a classifier that identifies tweets that are interesting to a majority of users
CROWDREC 2014
Careful with That Axe Data, EugeneIn the area of big data and machine learning:
◦ labels -> features -> predictive model -> optimization
Labeling/experimentation perceived as boring
Don’t rush labeling◦ Human and machine
Label quality is very important ◦ Don’t outsource it
◦ Own it end to end
◦ Large scale
CROWDREC 2014
More on label qualityData gathering is not a free lunch
You can’t outsource label acquisition and quality
Labels for the machine != labels for humans
Emphasis on algorithms, models/optimizations and mining from labels
Not so much on algorithms for ensuring high quality labels
Training sets
CROWDREC 2014
People are more than HPUsWhy is Facebook popular? People are social.
Information needs are contextually grounded in our social experiences and social networks
Our social networks also embody additional knowledge about us, our needs, and the world
We relate to recommendations
The social dimension complements computation
CROWDREC 2014
Opportunities in RecSys
CROWDREC 2014
Humans in the loopComputation loops that mix humans and machines
Kind of active learning
Double goal:◦ Human checking on the machine
◦ Machine checking on humans
Example: classifiers for social data
CROWDREC 2014
Collaborative Filtering v2Collaboration with recipients
Interactive
Learning new data
CROWDREC 2014
What’s in a label?Clicks, reviews, ratings, etc.
Better or novel systems if we focus more on label quality?
New ways of collecting data
Training sets
Evaluation & measurement
CROWDREC 2014
RoutingExpertise detection and routing
Social load balancing
When to switch between machines and humans
CROWDREC 2014
ConclusionsCrowdsourcing at scale works but requires a solid framework
Three aspects that need attention: workers, work and task design
Labeling social data is hard
Traditional IR approaches don’t seem to work for Twitter data
Label quality
Outlined areas where RecSys can benefit from crowdsourcing
CROWDREC 2014
Thank you
CROWDREC 2014
@elunca