Download - Crowdsourcing using MTurk for HCI research
Crowdsourcing using Mechanical Turk for Human Computer Interaction Research
Ed H. Chi
Research Scientist Google (work done while at [Xerox] PARC)
1
Historical Footnote
De Prony, 1794, hired hairdressers • (unemployed after French revolution; knew only
addition and subtraction)
• to create logarithmic and trigonometric tables.
• He managed the process by splitting the work into very detailed workflows.
– Grier, When computers were human, 2005
2
!"#$% &'#(")$)*'%+ ,'"%- .• !"#$%/ 0121 )31 4*2/)56'#(")12/+7 "/1- 4'2#$)3 6'#(")$)*'%/
• !"#$%&'() 6'#(")$)*'%8– &9$*2$")+ $/)2'%'#:+ .;<=8:&'#(")1- )31 !$991:>/6'#1) '2?*) @)3211 ?'-:(2'?91#A )&*&)&%# +,((2'?91#A )&*&)&%# +,(-$./" '4 %"#12*66'#(")$)*'%/ $62'// B$/)2'%'#12/
C2*12+ D31% 6'#(")12/ 0121 3"#$%+ EFF<C2*12+ GHHH I%%$9/ .JJ=
Talk in 3 Acts
• Act 1: – How we almost failed in using MTurk?! – [Kittur, Chi, Suh, CHI2008]
• Act II: – Apply MTurk to visualization evaluation – [Kittur, Suh, Chi, CSCW2008]
• Act III: – Where are the limits?
3
Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In CHI2008. Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In CSCW2008.
Example Task from Amazon MTurk
4
Using Mechanical Turk for user studies
Traditional user studies
Mechanical Turk
Task complexity Complex Long
Simple Short
Task subjectivity Subjective Opinions
Objective Verifiable
User information Targeted demographics High interactivity
Unknown demographics Limited interactivity
Can Mechanical Turk be usefully used for user studies?
5
Task
• Assess quality of Wikipedia articles • Started with ratings from expert Wikipedians
– 14 articles (e.g., “Germany”, “Noam Chomsky”) – 7-point scale
• Can we get matching ratings with mechanical turk?
6
Experiment 1
• Rate articles on 7-point scales: – Well written
– Factually accurate
– Overall quality
• Free-text input: – What improvements does the article need?
• Paid $0.05 each
7
Experiment 1: Good news
• 58 users made 210 ratings (15 per article) – $10.50 total
• Fast results – 44% within a day, 100% within two days
– Many completed within minutes
8
Experiment 1: Bad news
• Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)
• Worse, 59% potentially invalid responses
• Nearly 75% of these done by only 8 users
Experiment 1
Invalid comments
49%
<1 min responses
31%
9
Not a good start
• Summary of Experiment 1: – Only marginal correlation with experts.
– Heavy gaming of the system by a minority
• Possible Response: – Can make sure these gamers are not rewarded
– Ban them from doing your hits in the future
– Create a reputation system [Delores Lab]
• Can we change how we collect user input ?
10
Design changes
• Use verifiable questions to signal monitoring – “How many sections does the article have?”
– “How many images does the article have?”
– “How many references does the article have?”
11
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as good-faith
answers – “Provide 4-6 keywords that would give someone a
good summary of the contents of the article”
12
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as good-faith
answers
• Make verifiable answers useful for completing task – Used tasks similar to how Wikipedians evaluate quality
(organization, presentation, references)
13
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as good-faith
answers
• Make verifiable answers useful for completing task
• Put verifiable tasks before subjective responses – First do objective tasks and summarization – Only then evaluate subjective quality
– Ecological validity?
14
Experiment 2: Results
• 124 users provided 277 ratings (~20 per article) • Significant positive correlation with Wikipedians
– r=.66, p=.01
• Smaller proportion malicious responses • Increased time on task
Experiment 1 Experiment 2
Invalid comments
49% 3%
<1 min responses
31% 7%
Median time 1:30 4:06
15
Quick Summary of Tips
1. Use verifiable questions to signal monitoring 2. Make malicious answers as high cost as good-faith answers 3. Make verifiable answers useful for completing task
4. Put verifiable tasks before subjective responses
• Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost
• Good results require careful task design
16
Generalizing to other MTurk studies
• Combine objective and subjective questions – Rapid prototyping: ask verifiable questions about content/
design of prototype before subjective evaluation
– User surveys: ask common-knowledge questions before asking for opinions
• Filtering for Quality – Put in a field for Free-Form Responses and Filter out
data without answers – Results that came in too quickly
– Sort by WorkerID and look for cut and paste answers
– Look for outliers in the data that are suspicious
17
Talk in 3 Acts
• Act 1: – How we almost failed?!
• Act II: – Applying MTurk to visualization evaluation
• Act III: – Where are the limits?
18
What would make you trust Wikipedia more?
20
What is Wikipedia?
“Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you’re getting the
best possible information.” – Steve Carell, The Office
21
What would make you trust Wikipedia more?
Nothing
22
What would make you trust Wikipedia more?
“Wikipedia, just by its nature, is impossible to trust completely. I don't think this can necessarily be changed.”
23
WikiDashboard Transparency of social dynamics can reduce conflict and coordination
issues Attribution encourages contribution
– WikiDashboard: Social dashboard for wikis – Prototype system: http://wikidashboard.parc.com
Visualization for every wiki page showing edit history timeline and top individual editors
Can drill down into activity history for specific editors and view edits to see changes side-by-side
24
Citation: Suh et al. CHI 2008 Proceedings
2011 UCBerkeley Visual Computing Retreat
Hillary Clinton
25 2011 UCBerkeley Visual Computing Retreat 25
Top Editor -‐ Wasted Time R
26 2011 UCBerkeley Visual Computing Retreat
Surfacing information
• Numerous studies mining Wikipedia revision history to surface trust-relevant information – Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;
Viegas et al., 2004; Zeng et al., 2006
• But how much impact can this have on user perceptions in a system which is inherently mutable?
Suh, Chi, Kittur, & Pendleton, CHI2008
27
Hypotheses
1. Visualization will impact perceptions of trust 2. Compared to baseline, visualization will
impact trust both positively and negatively 3. Visualization should have most impact when
high uncertainty about article • Low quality • High controversy
28
Design
• 3 x 2 x 2 design
Abortion
George Bush
Volcano
Shark
Pro-life feminism
Scientology and celebrities
Disk defragmenter
Beeswax
Controversial Uncontroversial
High quality
Low quality
Visualization • High stability • Low stability • Baseline (none)
29
Example: High trust visualization
30
Example: Low trust visualization
31
Summary info
• % from anonymous users
32
Summary info
• % from anonymous users
• Last change by anonymous or established user
33
Summary info
• % from anonymous users
• Last change by anonymous or established user
• Stability of words
34
Graph
• Instability
35
Method
• Users recruited via Amazon’s Mechanical Turk – 253 participants – 673 ratings – 7 cents per rating – Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies
• To ensure salience and valid answers, participants answered: – In what time period was this article the least stable? – How stable has this article been for the last month? – Who was the last editor? – How trustworthy do you consider the above editor?
36
Results
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
main effects of quality and controversy: • high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001) • uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .031)
37
Results
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
interaction effects of quality and controversy: • high quality articles were rated equally trustworthy whether controversial or not, while • low quality articles were rated lower when they were controversial than when they were uncontroversial.
38
Results
1. Significant effect of visualization: High-Stability > Low-Stability, p < .001 2. Viz has both positive and negative effects:
– High-Stability > Baseline (p < .001) > Low-Stability, p < .01 3. No interaction of visualization with either quality or controversy
– Robust across visualization conditions
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
39
Results
1. Significant effect of visualization: High-Stability > Low-Stability, p < .001 2. Viz has both positive and negative effects:
– High-Stability > Baseline (p < .001) > Low-Stability, p < .01 3. No interaction of visualization with either quality or controversy
– Robust across visualization conditions
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
40
Results
1. Significant effect of visualization: High-Stability > Low-Stability, p < .001 2. Viz has both positive and negative effects:
– High-Stability > Baseline (p < .001) > Low-Stability, p < .01 3. No interaction of visualization with either quality or controversy
– Robust across visualization conditions
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
41
Talk in 3 Acts
• Act 1: – How we almost failed?!
• Act II: – Applying MTurk to visualization evaluation
• Act III: – Where are the limits?
42
Limitations of Mechanical Turk
• No control of users’ environment – Potential for different browsers, physical distractions
– General problem with online experimentation
• Not yet designed for user studies – Difficult to do between-subjects design
– May need some programming
• Hard to control user population – hard to control demographics, expertise
43
Crowdsourcing for HCI Research
• Does my interface/visualization work? – WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data – What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms – Intrinsic vs. Extrinsic rewards: Games vs. Pay – [Horton & Chilton, 2010 for Mturk] and [Ariely, 2009] in general
44
Crowdsourcing for HCI Research
• Does my interface/visualization work? – WikiDashboard: transparency vis for Wikipedia [Suh et al. VAST,
Kittur et al. CSCW2008]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data – What is a Question in Twitter? [S. Paul, L. Hong, E. Chi, ICWSM 2011]
• Incentive mechanisms – Intrinsic vs. Extrinsic rewards: Games vs. Pay – [Horton & Chilton, 2010 on MTurk] and Satisficing – [Ariely, 2009] in general: Higher pay != Better work
45
Managing Quality
• Quality through redundancy: Combining votes – Majority vote [work best when similar worker quality]
– Worker-Quality‐adjusted vote
– Managing dependencies
• Quality through gold data – Advantaged when imbalanced dataset & bad workers
• Estimating worker quality (Redundancy + Gold) – Calculate the confusion matrix and see if you actually
get some information from the worker
• Toolkit: http://code.google.com/p/get‐another‐label/
46 Source: Ipeirotis, WWW2011
Coding and Machine Learning
• Integration with Machine Learning – Build automatic classification models using
crowdsourced data
47
!"#$%& '(%)*"(+
• ,)#-+' %-.&% */-"+"+0 1-*-,)#-+' %-.&% */-"+"+0 1-*-• 2'& */-"+"+0 1-*- *( .)"%1 #(1&%
Data from existing
crowdsourced answerscrowdsourced answers
N CNew Case Automatic Model
(through machine learning)
Automatic
Answer
Source: Ipeirotis, WWW2011
Crowd Programming for Complex Tasks
• Decompose tasks into smaller tasks – Digital Taylorism
– Frederick Winslow Taylor (1856-1915)
– 1911 'Principles Of Scientific Management’
• Crowd Programming Explorations – MapReduce Models
• Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge. • Kulkarni, Can, Hartmann, CHI2011 workshop & WIP
– Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In KDD 2010 Workshop on Human Computation.
48
Crowd Programming for Complex Tasks
• Crowd Programming Explorations – Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on
CrowdForge. – Kulkarni, Can, Hartmann, CHI2011 workshop & WIP
49
“Please solve the 16-question SAT located at
http://bit.ly/SATexam”. In both cases, we paid workers
between $0.10 and $0.40 per HIT. Each “subdivide” or
“merge” HIT received answers within 4 hours; solutions
to the initial task were complete within 72 hours.
Results
The decompositions produced by Turkers while running
Turkomatic are displayed in Figure 1 (essay-writing)
and Figure 4 (SAT).
In the essay task, each “subdivide” HIT was posted
three times by Turkomatic and the best of the three
was selected by experimenters (simulating Turker
voting) to continue the solution process. The proposed
decompositions were overwhelmingly linear and chose
to break the task down either by paragraph or by
activity (for example, one Turker proposed: brainstorm,
create outline, write topic sentences, fill in facts). The
decomposition used in the final essay used two levels of
recursion. As groups of subtasks were completed,
Turkomatic passed solutions to merge workers for
reassembly. The resulting essay is complete and
coherent, although somewhat lacking in cohesion.
We allowed essay-writers to pick a topic; the chosen
one (university legacy admissions) was somewhat
specialized, but the final essay displayed a reasonably
good understanding of the topic, even if the writing
quality was often mixed. The decomposition selected
for the SAT task used only one level of recursion.
Workers divided the task into 12 subtasks consisting of
1 to 3 thematically linked questions. These were each
solved in parallel by distinct workers and the results
were given to a merge worker who produced the final
solution. The score on the overall solution was 12/17,
with the worst performance on math and grammar
questions and the best in reading and vocabulary.
Obtaining useful decompositions proved tricky for
workers – many seemed confused about the nature of
the planning task. However, once the tasks were
decomposed, solution of the constituent parts and
reassembly into an overall solution were
straightforward for Turkers to accomplish.
Evaluation: Interface
In a second informal study, we examined whether
reducing user involvement in the HIT design improved
ease of use and efficiency. We hypothesized that the
high level of abstraction enabled by automatic task
design would make it easier for requesters to
crowdsource their work.
We asked a pool of four users to try to collect answers
for a basic brainstorming task on Mechanical Turk. The
task asked our participants to generate five ideas of
topics for an essay. Participants performed this task
twice, first, using Turkomatic to post tasks and obtain
results, then, using Mechanical Turk’s web interface. No
instruction on either interface was provided. We
examined how long it took the user to post the task.
With Turkomatic, our users finished posting their tasks
in an average of 37 seconds. On Mechanical Turk,
where low-level task design was required, users needed
an average of 244.2 seconds to post their tasks. More
importantly, the HITs posted by two users who were
not familiar with Mechanical Turk would not have
produced any meaningful results. One user posted
minor variations of the default templates provided on
Figure 4. For the SAT task, we uploaded
sixteen questions from a high school
Scholastic Aptitude Test to the web and
posed the following task to Turkomatic:
“Please solve the 16-question SAT located
at http://bit.ly/SATexam”.
! !
"#!$%&!'%()(*!%!(&+,-.-+/!&01,+((-#2!('+&!-(!%&&3-+/!'1!+%,4!-'+$!-#!'4+!&%0'-'-1#5!64+(+!'%()(!%0+!-/+%337!(-$&3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!(410'!%$18#'!1.!'-$+5!;10!+<%$&3+*!%!$%&!'%()!.10!%0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!%!2-=+#!'1&-,!-#!'4+!%0'-,3+>(!18'3-#+5!?83'-&3+!-#('%#,+(!1.!%!$%&!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!&%0'-'-1#@!+525*!$83'-&3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!+%,4!1#!%!'1&-,!-#!&%0%33+35!
;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#!$%&!'%()!%#/!,1#(13-/%'+!'4+$*!'7&-,%337!-#'1!%!(-#23+!0+(83'5!"#!'4+!%0'-,3+!:0-'-#2!+<%$&3+*!%!0+/8,+!('+&!$-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1&-,!97!$%#7!:10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!&%0%20%&45!
A#7!1.!'4+(+!('+&(!,%#!9+!-'+0%'-=+5!;10!+<%$&3+*!'4+!'1&-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!&%0'-'-1#!,%#!-'(+3.!9+!&%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!&%0%20%&4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+&!,%#!-#!'80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+&5!
!"#$%#&'()$#%C+!+<&310+/!%(!%!,%(+!('8/7!'4+!,1$&3+<!'%()!1.!:0-'-#2!%#!+#,7,31&+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%!,4%33+#2-#2!%#/!-#'+0/+&+#/+#'!'%()!'4%'!-#=13=+(!$%#7!/-..+0+#'!(89'%()(D!&3%##-#2!'4+!(,1&+!1.!'4+!%0'-,3+*!41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2!-#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8&!'4%'!-#.10$%'-1#*!.-#/-#2!%#/!.-<-#2!20%$$%0!%#/!(&+33-#2*!%#/!$%)-#2!'4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!:0-'-#2!%!,4%33+#2-#2!98'!0+&0+(+#'%'-=+!'+('!,%(+!.10!180!%&&01%,45!
61!(13=+!'4-(!&0193+$!:+!,0+%'+/!%!(-$&3+!.31:!,1#(-('-#2!1.!%!&%0'-'-1#*!$%&*!%#/!0+/8,+!('+&5!!64+!
&%0'-'-1#!('+&!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!0+&0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!EF-('107G!%#/!EH+120%&47G5!"#!%#!+#=-01#$+#'!:4+0+!:10)+0(!:183/!,1$&3+'+!4-24!+..10'!'%()(*!'4+!#+<'!('+&!$-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!&%0%20%&4!.10!+%,4!(+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!.-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$&3+'+!&%0%20%&4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!,%&%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!8&!.80'4+0*!(+&%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!:0-'-#2!(89'%()(5!B&+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!.01$!'4+!&%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%&!'%()(!-#!
*)+',$%-.%/",&)"0%,$#'0&#%12%"%3100"41,"&)5$%6,)&)7+%&"#89%
CHI 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada
1804
Future Directions in Crowdsourcing
• Real-time Crowdsourcing – Bigham, et al. VizWiz, UIST 2010
50
What color is this pillow? What denomination is this bill?
Do you see picnic tables across the parking lot?
What temperature is my oven set to?
Can you please tell me what this can is?
What kind of drink does this can hold?
(89s) .(105s) multiple shades of soft green, blue and gold
(24s) 20(29s) 20
(13s) no(46s) no
(69s) it looks like 425 degrees but the image is difficult to see.(84s) 400(122s) 450
(183s) chickpeas.(514s) beans(552s) Goya Beans
(91s) Energy(99s) no can in the picture(247s) energy drink
Figure 2: Six questions asked by participants, the photographs they took, and answers received with latency in seconds.
the total time required to answer a question. quikTurkit alsomakes it easy to keep a pool of workers of a given size contin-uously engaged and waiting, although workers must be paidto wait. In practice, we have found that keeping 10 or moreworkers in the pool is doable, although costly.
Most Mechanical Turk workers find HITs to do using theprovided search engine3. This search engine allows users toview available HITs sorted by creation date, the number ofHITs available, the reward amount, the expiration date, thetitle, or the time alloted for the work. quikTurkit employsseveral heuristics for optimizing its listing in order to obtainworkers quickly. First, it posts many more HITs than areactually required at any time because only a fraction will ac-tually be picked up within the first few minutes. These HITsare posted in batches, helping quikTurkit HITs stay near thetop. Finally, quikTurkit supports posting multiple HIT vari-ants at once with different titles or reward amounts to covermore of the first page of search results.
VizWiz currently posts a maximum of 64 times more HITsthan are required, posts them at a maximum rate of 4 HITsevery 10 seconds, and uses 6 different HIT variants (2 titles× 3 rewards). These choices are explored more closely in thecontext of VizWiz in the following section.
FIELD DEPLOYMENT
To better understand how VizWiz might be used by blindpeople in their everyday lives, we deployed it to 11 blindiPhone users aged 22 to 55 (3 female). Participants were re-cruited remotely and guided through using VizWiz over thephone until they felt comfortable using it. The wizard inter-face used by VizWiz speaks instructions as it goes, and soparticipants generally felt comfortable using VizWiz after asingle use. Participants were asked to use VizWiz at leastonce a day for one week. After each answer was returned,participants were prompted to leave a spoken comment.
quikTurkit used the following two titles for the jobs that itposted to Mechanical Turk: “3 Quick Visual Questions” and“Answer Three Questions for A Blind Person.” The reward3Available at mturk.com
distribution was set such that half of the HITs posted paid$0.01, and a quarter paid $0.02 and $0.03 each.
Asking Questions Participants asked a total of 82 questions(See Figure 2 for participant examples and accompanyingphotographs). Speech recognition correctly recognized thequestion asked for only 13 of the 82 questions (15.8%), and55 (67.1%) questions could be answered from the photostaken. Of the 82 questions, 22 concerned color identifica-tion, 14 were open ended “what is this?” or “describe thispicture” questions, 13 were of the form “what kind of (blank)is this?,” 12 asked for text to be read, 12 asked whether a par-ticular object was contained within the photograph, 5 askedfor a numerical answer or currency denomination, and 4 didnot fit into these categories.
Problems Taking Pictures 9 (11.0%) of the images takenwere too dark for the question to be answered, and 17 (21.0%)were too blurry for the question to be answered. Although afew other questions could not be answered due to the pho-tos that were taken, photos that were too dark or too blurrywere the most prevalent reason why questions could not beanswered. In the next section, we discuss a second iterationon the VizWiz prototype that helps to alert users to these par-ticular problems before sending the questions to workers.
Answers Overall, the first answer received was correct in71 of 82 cases (86.6%), where “correct” was defined as eitherbeing the answer to the question or an accurate descriptionof why the worker could not answer the question with theinformation contained within the photo provided (i.e., “Thisimage is too blurry”). A correct answer was received in allcases by the third answer.
The first answer was received across all questions in an aver-age of 133.3 seconds (SD=132.7), although the latency re-quired varied dramatically based on whether the questioncould actually be answered from the picture and on whetherthe speech recognition accurately recognized the question(Figure 4). Workers took 105.5 seconds (SD=160.3) on av-erage to answer questions that could be answered by the pro-vided photo compared to 170.2 seconds (SD=159.5) for those
Future Directions in Crowdsourcing
• Real-time Crowdsourcing – Bigham, et al. VizWiz, UIST 2010
• Embedding of Crowdwork inside Tools – Bernstein, et al. Solyent, UIST 2010
51
Future Directions in Crowdsourcing
• Real-time Crowdsourcing – Bigham, et al. VizWiz, UIST 2010
• Embedding of Crowdwork inside Tools – Bernstein, et al. Solyent, UIST 2010
• Shepherding Crowdwork – Dow et al. CHI2011 WIP
52
workers to persevere and accept additional tasks. We investigate these hypotheses through a prototype system, Shepherd, that demonstrates how to make feedback an integral part of crowdsourced creative work.
Understanding Opportunities for Crowd Feedback To effectively design feedback mechanisms that achieve the goals of learning, engagement, and quality improvement, we first analyze the important dimensions of the design space for crowd feedback (Figure 2).
Timeliness: When should feedback be shown? In micro-task work, workers stay with tasks for a short while, then move on. This implies two timing options: synchronously deliver feedback when workers are still engaged in a set of tasks, or asynchronously deliver feedback after workers have completed the tasks.
Synchronous feedback may have more impact on future task performance since it arrives while workers are still thinking about the task domain. It also increases the probability that workers will continue onto similar tasks. However, synchronous feedback places a burden on the feedback providers; they have little time to review work. This implies a need for tools or scheduling algorithms that enable near real-time feedback. Asynchronous feedback gives feedback providers more time to review and comment on work.
However, workers may have forgotten about the task or feel unmotivated to review the feedback and to return to the task.
Currently, platforms like Mechanical Turk only allow asynchronous feedback with no enticement to return. Requesters can provide feedback at payment time, but at that point (typically days later), workers care more about getting paid than improving submitted work. More importantly, unless requesters have more jobs available, workers cannot act on requesters’ advice.
Specificity: How detailed should feedback be? Mechanical Turk currently allows requesters one bit of feedback—accept or reject. While additional freeform communication is possible, it is rarely used unless workers file complaints. Workers may learn more if they receive detailed and personalized feedback on each piece of work. However, this added specificity comes at a price: feedback providers must spend time authoring feedback. When feedback resources are limited, customizable templates can accelerate feedback generation and enable requesters to codify domain knowledge into pre-authored statements. However, templates could be perceived as overly general or repetitive, reducing their desired impact. Workers may need explicit incentive to read and reflect on feedback.
Source: Who should provide feedback? Crowdsourcing requesters post tasks with specific quality objectives in mind; they are a natural choice for assuming the feedback role. However, experts often underestimate the difficulty novices face in solving tasks [7] or use language or concepts that are beyond the grasp of novices [6]. Moreover, as feedback
Figure 2: Current systems (in orange) focus on asynchronous, single-bit feedback by requesters. Shepherd (in blue) investigates richer, synchronous feedback by requesters and peers.
Tutorials
• Matt Lease http://ir.ischool.utexas.edu/crowd/ • AAAI 2011 (w HCOMP 2011): Human Computation: Core Research Questions
and State of the Art (E. Law & Luis von Ahn) • WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to Work for
You (Omar Alonso and Matthew Lease) – http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf
• LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and Massimo Poesio)
– http://lingpipe-blog.com/2010/05/17/
• ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso) – http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html
• CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐Fei Li) – http://sites.google.com/site/turkforvision/
• CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) – http://videolectures.net/cikm08_rose_cfre/
• WWW2011: Managing Crowdsourced Human Computation (Panos Ipeirotis) – http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation
53
Social Q&A on Twitter !!
S. Paul, L. Hong, E. Chi, ICWSM 2011
54 3/27/12
Why social Q&A? !
!
!
55 3/27/12
People turn to their friends on social networks because they trust their friends to provide tailored answers to subjective questions on niche topics.!
!
!
What kinds of questions are Twitter users asking their friends? !Types and topics of questions !!
Are users receiving responses to the questions they are asking? !Number, speed, and relevancy of responses !!
How does the nature of the social network affect Q&A behavior? !Size and usage of network, reciprocity of relationship !
Research Questions !!
!
58 3/27/12
Identifying question tweets was challenging !!
!Advertisement framed as question !
!
! Rhetorical question !
!
!Missing context !
!
!
59 3/27/12
Used heuristics to identify candidate tweets that were possibly questions !
!
• Each Tweet classified by two Turkers !• Each Turker classified 25 tweets: 20 candidates and 5
control tweets !• Only accepted data from Turkers who classified all
control tweets correctly !
Classifying candidates tweets using Mechanical Turk !Crowd-sourced question tweet identification to Amazon Mechanical Turk !
!
!Control tweet !
!
!
60 3/27/12
1.2 million !12,000 ! (4,100 presented to Turkers) !
Applied heuristics to !identify candidate tweets !
Candidate tweets Random sample of public tweets
Tracked responses !to each candidate tweet !
Classified candidates ! using Mechanical Turk !
1152 !
624 !
Overall method for filtering questions !!
61 3/27/12
Rhetorical (42%), factual (16%), and poll (15%) questions were common !Significant percentage of personal & health (11%)questions !!
Findings: Types and topics of questions !!
!
62 3/27/12
How do you feel about interracial dating?
In UK, when you need to see a specialist, do you need special forms or permission?
entertainment 32%
technology 10% personal
& health 11%
ethics & philosophy
7%
gree8ngs 7%
current events 4%
restaurant/food 4%
professional 4%
uncategorized 5%
others 16%
Which team is better raiders or steelers?
Any good iPad app recommendations?
Any idea how to lost weight fast?
Question types ! Question topics !
Findings: Responses to questions !!
!Low (18.7%) response rate in general, but quick responses !
!0
1
2
3
4
5
6
7
8
log(n
umbe
r of q
uesti
ons)
0 1 2 3 4 5 6 7 8 10 16 17 28 29 39 147
Number of answers
Number of responses have a long tail distribution !
Most often reciprocity between asker and answerer was one-way (55%) !Responses were largely (84%) relevant!
! 63 3/27/12
Logistic regression modeling (structural properties) !!Number of followers (+) " "Number of tweets posted !Number of days on Twitter (+)" "Frequency of use of Twitter !Ratio of followers/followees (+) !Reciprocity rate (-)!
!!!
Findings: Social network characteristics !!
!
Which characteristics of asker predict whether she will receive a response? !
!
!
64 3/27/12
Network size and status in network are good predictors of whether asker will receive response !
Thanks!
• http://edchi.net • @edchi
• Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008. Florence, Italy.
• Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008. San Diego, CA. [Best Note Award]
66