crowdsourcing using mturk for hci research

Crowdsourcing using Mechanical Turk for Human Computer Interaction Research

Ed H. Chi

Research Scientist Google (work done while at [Xerox] PARC)

1

Historical Footnote

De Prony, 1794, hired hairdressers •  (unemployed after French revolution; knew only

addition and subtraction)

•  to create logarithmic and trigonometric tables.

•  He managed the process by splitting the work into very detailed workflows.

–  Grier, When computers were human, 2005

2

!"#$% &'#(")$)*'%+ ,'"%- .• !"#$%/ 0121 )31 4*2/)56'#(")12/+7 "/1- 4'2#$)3 6'#(")$)*'%/

• !"#$%&'() 6'#(")$)*'%8– &9$*2$")+ $/)2'%'#:+ .;<=8:&'#(")1- )31 !$991:>/6'#1) '2?*) @)3211 ?'-:(2'?91#A )&*&)&%# +,((2'?91#A )&*&)&%# +,(-$./" '4 %"#12*66'#(")$)*'%/ $62'// B$/)2'%'#12/

C2*12+ D31% 6'#(")12/ 0121 3"#$%+ EFF<C2*12+ GHHH I%%$9/ .JJ=

Talk in 3 Acts

•  Act 1: –  How we almost failed in using MTurk?! –  [Kittur, Chi, Suh, CHI2008]

•  Act II: –  Apply MTurk to visualization evaluation –  [Kittur, Suh, Chi, CSCW2008]

•  Act III: –  Where are the limits?

3

Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In CHI2008. Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In CSCW2008.

Example Task from Amazon MTurk

4

Using Mechanical Turk for user studies

Traditional user studies

Mechanical Turk

Task complexity Complex Long

Simple Short

Task subjectivity Subjective Opinions

Objective Verifiable

User information Targeted demographics High interactivity

Unknown demographics Limited interactivity

Can Mechanical Turk be usefully used for user studies?

5

Task

•  Assess quality of Wikipedia articles •  Started with ratings from expert Wikipedians

–  14 articles (e.g., “Germany”, “Noam Chomsky”) –  7-point scale

•  Can we get matching ratings with mechanical turk?

6

Experiment 1

•  Rate articles on 7-point scales: –  Well written

–  Factually accurate

–  Overall quality

•  Free-text input: –  What improvements does the article need?

•  Paid $0.05 each

7

Experiment 1: Good news

•  58 users made 210 ratings (15 per article) –  $10.50 total

•  Fast results –  44% within a day, 100% within two days

–  Many completed within minutes

8

Experiment 1: Bad news

•  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)

•  Worse, 59% potentially invalid responses

•  Nearly 75% of these done by only 8 users

Experiment 1

Invalid comments

49%

<1 min responses

31%

9

Not a good start

•  Summary of Experiment 1: –  Only marginal correlation with experts.

–  Heavy gaming of the system by a minority

•  Possible Response: –  Can make sure these gamers are not rewarded

–  Ban them from doing your hits in the future

–  Create a reputation system [Delores Lab]

•  Can we change how we collect user input ?

10

Design changes

•  Use verifiable questions to signal monitoring –  “How many sections does the article have?”

–  “How many images does the article have?”

–  “How many references does the article have?”

11

Design changes

•  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith

answers –  “Provide 4-6 keywords that would give someone a

good summary of the contents of the article”

12

Design changes


answers

•  Make verifiable answers useful for completing task –  Used tasks similar to how Wikipedians evaluate quality

(organization, presentation, references)

13

Design changes


answers

•  Make verifiable answers useful for completing task

•  Put verifiable tasks before subjective responses –  First do objective tasks and summarization –  Only then evaluate subjective quality

–  Ecological validity?

14

Experiment 2: Results

•  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians

–  r=.66, p=.01

•  Smaller proportion malicious responses •  Increased time on task

Experiment 1 Experiment 2

Invalid comments

49% 3%

<1 min responses

31% 7%

Median time 1:30 4:06

15

Quick Summary of Tips

1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith answers 3.  Make verifiable answers useful for completing task

4.  Put verifiable tasks before subjective responses

•  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost

•  Good results require careful task design

16

Generalizing to other MTurk studies

•  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about content/

design of prototype before subjective evaluation

–  User surveys: ask common-knowledge questions before asking for opinions

•  Filtering for Quality –  Put in a field for Free-Form Responses and Filter out

data without answers –  Results that came in too quickly

–  Sort by WorkerID and look for cut and paste answers

–  Look for outliers in the data that are suspicious

17

Talk in 3 Acts

•  Act 1: –  How we almost failed?!

•  Act II: –  Applying MTurk to visualization evaluation


18

What would make you trust Wikipedia more?

20

What is Wikipedia?

“Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you’re getting the

best possible information.” – Steve Carell, The Office

21


Nothing

22


“Wikipedia, just by its nature, is impossible to trust completely. I don't think this can necessarily be changed.”

23

WikiDashboard   Transparency of social dynamics can reduce conflict and coordination

issues   Attribution encourages contribution

–  WikiDashboard: Social dashboard for wikis –  Prototype system: http://wikidashboard.parc.com

  Visualization for every wiki page showing edit history timeline and top individual editors

  Can drill down into activity history for specific editors and view edits to see changes side-by-side

24

Citation: Suh et al. CHI 2008 Proceedings

2011 UCBerkeley Visual Computing Retreat

Hillary Clinton

25 2011 UCBerkeley Visual Computing Retreat 25

Top Editor -‐ Wasted Time R

26 2011 UCBerkeley Visual Computing Retreat

Surfacing information

•  Numerous studies mining Wikipedia revision history to surface trust-relevant information –  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;

Viegas et al., 2004; Zeng et al., 2006

•  But how much impact can this have on user perceptions in a system which is inherently mutable?

Suh, Chi, Kittur, & Pendleton, CHI2008

27

Hypotheses

1.  Visualization will impact perceptions of trust 2.  Compared to baseline, visualization will

impact trust both positively and negatively 3.  Visualization should have most impact when

high uncertainty about article •  Low quality •  High controversy

28

Design

•  3 x 2 x 2 design

Abortion

George Bush

Volcano

Shark

Pro-life feminism

Scientology and celebrities

Disk defragmenter

Beeswax

Controversial Uncontroversial

High quality

Low quality

Visualization •  High stability •  Low stability •  Baseline (none)

29

Example: High trust visualization

30

Example: Low trust visualization

31

Summary info

•  % from anonymous users

32

Summary info


•  Last change by anonymous or established user

33

Summary info


•  Last change by anonymous or established user

•  Stability of words

34

Graph

•  Instability

35

Method

•  Users recruited via Amazon’s Mechanical Turk –  253 participants –  673 ratings –  7 cents per rating –  Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies

•  To ensure salience and valid answers, participants answered: –  In what time period was this article the least stable? –  How stable has this article been for the last month? –  Who was the last editor? –  How trustworthy do you consider the above editor?

36

Results

1

2

3

4

5

6

7

Low qual High qual Low qual High qual

Uncontroversial Controversial

Trus

twor

thin

ess r

atin

g

High stability Baseline Low stability

main effects of quality and controversy: • high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001) • uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .031)

37

Results

1

2

3

4

5

6

7



Trus

twor

thin

ess r

atin

g


interaction effects of quality and controversy: • high quality articles were rated equally trustworthy whether controversial or not, while • low quality articles were rated lower when they were controversial than when they were uncontroversial.

38

Results

1.  Significant effect of visualization: High-Stability > Low-Stability, p < .001 2.  Viz has both positive and negative effects:

–  High-Stability > Baseline (p < .001) > Low-Stability, p < .01 3.  No interaction of visualization with either quality or controversy

–  Robust across visualization conditions

1

2

3

4

5

6

7



Trus

twor

thin

ess r

atin

g


39

Results




1

2

3

4

5

6

7



Trus

twor

thin

ess r

atin

g


40

Results




1

2

3

4

5

6

7



Trus

twor

thin

ess r

atin

g


41

Talk in 3 Acts

•  Act 1: –  How we almost failed?!

•  Act II: –  Applying MTurk to visualization evaluation


42

Limitations of Mechanical Turk

•  No control of users’ environment –  Potential for different browsers, physical distractions

–  General problem with online experimentation

•  Not yet designed for user studies –  Difficult to do between-subjects design

–  May need some programming

•  Hard to control user population –  hard to control demographics, expertise

43

Crowdsourcing for HCI Research

•  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.]

–  Replicating Perceptual Experiments [Heer et al., CHI2010]

•  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]

•  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton & Chilton, 2010 for Mturk] and [Ariely, 2009] in general

44

Crowdsourcing for HCI Research

•  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al. VAST,

Kittur et al. CSCW2008]

–  Replicating Perceptual Experiments [Heer et al., CHI2010]

•  Coding of large amount of user data –  What is a Question in Twitter? [S. Paul, L. Hong, E. Chi, ICWSM 2011]

•  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton & Chilton, 2010 on MTurk] and Satisficing –  [Ariely, 2009] in general: Higher pay != Better work

45

Managing Quality

•  Quality through redundancy: Combining votes –  Majority vote [work best when similar worker quality]

–  Worker-Quality‐adjusted vote

–  Managing dependencies

•  Quality through gold data –  Advantaged when imbalanced dataset & bad workers

•  Estimating worker quality (Redundancy + Gold) –  Calculate the confusion matrix and see if you actually

get some information from the worker

•  Toolkit: http://code.google.com/p/get‐another‐label/

46 Source: Ipeirotis, WWW2011

Coding and Machine Learning

•  Integration with Machine Learning –  Build automatic classification models using

crowdsourced data

47

!"#$%& '(%)*"(+

• ,)#-+' %-.&% */-"+"+0 1-*-,)#-+' %-.&% */-"+"+0 1-*-• 2'& */-"+"+0 1-*- *( .)"%1 #(1&%

Data from existing

crowdsourced answerscrowdsourced answers

N CNew Case Automatic Model

(through machine learning)

Automatic

Answer

Source: Ipeirotis, WWW2011

Crowd Programming for Complex Tasks

•  Decompose tasks into smaller tasks –  Digital Taylorism

–  Frederick Winslow Taylor (1856-1915)

–  1911 'Principles Of Scientific Management’

•  Crowd Programming Explorations –  MapReduce Models

•  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge. •  Kulkarni, Can, Hartmann, CHI2011 workshop & WIP

–  Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In KDD 2010 Workshop on Human Computation.

48

Crowd Programming for Complex Tasks

•  Crowd Programming Explorations –  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on

CrowdForge. –  Kulkarni, Can, Hartmann, CHI2011 workshop & WIP

49

“Please solve the 16-question SAT located at

http://bit.ly/SATexam”. In both cases, we paid workers

between $0.10 and $0.40 per HIT. Each “subdivide” or

“merge” HIT received answers within 4 hours; solutions

to the initial task were complete within 72 hours.

Results

The decompositions produced by Turkers while running

Turkomatic are displayed in Figure 1 (essay-writing)

and Figure 4 (SAT).

In the essay task, each “subdivide” HIT was posted

three times by Turkomatic and the best of the three

was selected by experimenters (simulating Turker

voting) to continue the solution process. The proposed

decompositions were overwhelmingly linear and chose

to break the task down either by paragraph or by

activity (for example, one Turker proposed: brainstorm,

create outline, write topic sentences, fill in facts). The

decomposition used in the final essay used two levels of

recursion. As groups of subtasks were completed,

Turkomatic passed solutions to merge workers for

reassembly. The resulting essay is complete and

coherent, although somewhat lacking in cohesion.

We allowed essay-writers to pick a topic; the chosen

one (university legacy admissions) was somewhat

specialized, but the final essay displayed a reasonably

good understanding of the topic, even if the writing

quality was often mixed. The decomposition selected

for the SAT task used only one level of recursion.

Workers divided the task into 12 subtasks consisting of

1 to 3 thematically linked questions. These were each

solved in parallel by distinct workers and the results

were given to a merge worker who produced the final

solution. The score on the overall solution was 12/17,

with the worst performance on math and grammar

questions and the best in reading and vocabulary.

Obtaining useful decompositions proved tricky for

workers – many seemed confused about the nature of

the planning task. However, once the tasks were

decomposed, solution of the constituent parts and

reassembly into an overall solution were

straightforward for Turkers to accomplish.

Evaluation: Interface

In a second informal study, we examined whether

reducing user involvement in the HIT design improved

ease of use and efficiency. We hypothesized that the

high level of abstraction enabled by automatic task

design would make it easier for requesters to

crowdsource their work.

We asked a pool of four users to try to collect answers

for a basic brainstorming task on Mechanical Turk. The

task asked our participants to generate five ideas of

topics for an essay. Participants performed this task

twice, first, using Turkomatic to post tasks and obtain

results, then, using Mechanical Turk’s web interface. No

instruction on either interface was provided. We

examined how long it took the user to post the task.

With Turkomatic, our users finished posting their tasks

in an average of 37 seconds. On Mechanical Turk,

where low-level task design was required, users needed

an average of 244.2 seconds to post their tasks. More

importantly, the HITs posted by two users who were

not familiar with Mechanical Turk would not have

produced any meaningful results. One user posted

minor variations of the default templates provided on

Figure 4. For the SAT task, we uploaded

sixteen questions from a high school

Scholastic Aptitude Test to the web and

posed the following task to Turkomatic:

“Please solve the 16-question SAT located

at http://bit.ly/SATexam”.

! !

"#!$%&!'%()(*!%!(&+,-.-+/!&01,+((-#2!('+&!-(!%&&3-+/!'1!+%,4!-'+$!-#!'4+!&%0'-'-1#5!64+(+!'%()(!%0+!-/+%337!(-$&3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!(410'!%$18#'!1.!'-$+5!;10!+<%$&3+*!%!$%&!'%()!.10!%0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!%!2-=+#!'1&-,!-#!'4+!%0'-,3+>(!18'3-#+5!?83'-&3+!-#('%#,+(!1.!%!$%&!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!&%0'-'-1#@!+525*!$83'-&3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!+%,4!1#!%!'1&-,!-#!&%0%33+35!

;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#!$%&!'%()!%#/!,1#(13-/%'+!'4+$*!'7&-,%337!-#'1!%!(-#23+!0+(83'5!"#!'4+!%0'-,3+!:0-'-#2!+<%$&3+*!%!0+/8,+!('+&!$-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1&-,!97!$%#7!:10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!&%0%20%&45!

A#7!1.!'4+(+!('+&(!,%#!9+!-'+0%'-=+5!;10!+<%$&3+*!'4+!'1&-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!&%0'-'-1#!,%#!-'(+3.!9+!&%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!&%0%20%&4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+&!,%#!-#!'80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+&5!

!"#$%#&'()$#%C+!+<&310+/!%(!%!,%(+!('8/7!'4+!,1$&3+<!'%()!1.!:0-'-#2!%#!+#,7,31&+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%!,4%33+#2-#2!%#/!-#'+0/+&+#/+#'!'%()!'4%'!-#=13=+(!$%#7!/-..+0+#'!(89'%()(D!&3%##-#2!'4+!(,1&+!1.!'4+!%0'-,3+*!41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2!-#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8&!'4%'!-#.10$%'-1#*!.-#/-#2!%#/!.-<-#2!20%$$%0!%#/!(&+33-#2*!%#/!$%)-#2!'4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!:0-'-#2!%!,4%33+#2-#2!98'!0+&0+(+#'%'-=+!'+('!,%(+!.10!180!%&&01%,45!

61!(13=+!'4-(!&0193+$!:+!,0+%'+/!%!(-$&3+!.31:!,1#(-('-#2!1.!%!&%0'-'-1#*!$%&*!%#/!0+/8,+!('+&5!!64+!

&%0'-'-1#!('+&!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!0+&0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!EF-('107G!%#/!EH+120%&47G5!"#!%#!+#=-01#$+#'!:4+0+!:10)+0(!:183/!,1$&3+'+!4-24!+..10'!'%()(*!'4+!#+<'!('+&!$-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!&%0%20%&4!.10!+%,4!(+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!.-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$&3+'+!&%0%20%&4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!,%&%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!8&!.80'4+0*!(+&%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!:0-'-#2!(89'%()(5!B&+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!.01$!'4+!&%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%&!'%()(!-#!

*)+',$%-.%/",&)"0%,$#'0&#%12%"%3100"41,"&)5$%6,)&)7+%&"#89%

CHI 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada

1804

Future Directions in Crowdsourcing

•  Real-time Crowdsourcing –  Bigham, et al. VizWiz, UIST 2010

50

What color is this pillow? What denomination is this bill?

Do you see picnic tables across the parking lot?

What temperature is my oven set to?

Can you please tell me what this can is?

What kind of drink does this can hold?

(89s) .(105s) multiple shades of soft green, blue and gold

(24s) 20(29s) 20

(13s) no(46s) no

(69s) it looks like 425 degrees but the image is difficult to see.(84s) 400(122s) 450

(183s) chickpeas.(514s) beans(552s) Goya Beans

(91s) Energy(99s) no can in the picture(247s) energy drink

Figure 2: Six questions asked by participants, the photographs they took, and answers received with latency in seconds.

the total time required to answer a question. quikTurkit alsomakes it easy to keep a pool of workers of a given size contin-uously engaged and waiting, although workers must be paidto wait. In practice, we have found that keeping 10 or moreworkers in the pool is doable, although costly.

Most Mechanical Turk workers find HITs to do using theprovided search engine3. This search engine allows users toview available HITs sorted by creation date, the number ofHITs available, the reward amount, the expiration date, thetitle, or the time alloted for the work. quikTurkit employsseveral heuristics for optimizing its listing in order to obtainworkers quickly. First, it posts many more HITs than areactually required at any time because only a fraction will ac-tually be picked up within the first few minutes. These HITsare posted in batches, helping quikTurkit HITs stay near thetop. Finally, quikTurkit supports posting multiple HIT vari-ants at once with different titles or reward amounts to covermore of the first page of search results.

VizWiz currently posts a maximum of 64 times more HITsthan are required, posts them at a maximum rate of 4 HITsevery 10 seconds, and uses 6 different HIT variants (2 titles× 3 rewards). These choices are explored more closely in thecontext of VizWiz in the following section.

FIELD DEPLOYMENT

To better understand how VizWiz might be used by blindpeople in their everyday lives, we deployed it to 11 blindiPhone users aged 22 to 55 (3 female). Participants were re-cruited remotely and guided through using VizWiz over thephone until they felt comfortable using it. The wizard inter-face used by VizWiz speaks instructions as it goes, and soparticipants generally felt comfortable using VizWiz after asingle use. Participants were asked to use VizWiz at leastonce a day for one week. After each answer was returned,participants were prompted to leave a spoken comment.

quikTurkit used the following two titles for the jobs that itposted to Mechanical Turk: “3 Quick Visual Questions” and“Answer Three Questions for A Blind Person.” The reward3Available at mturk.com

distribution was set such that half of the HITs posted paid$0.01, and a quarter paid $0.02 and $0.03 each.

Asking Questions Participants asked a total of 82 questions(See Figure 2 for participant examples and accompanyingphotographs). Speech recognition correctly recognized thequestion asked for only 13 of the 82 questions (15.8%), and55 (67.1%) questions could be answered from the photostaken. Of the 82 questions, 22 concerned color identifica-tion, 14 were open ended “what is this?” or “describe thispicture” questions, 13 were of the form “what kind of (blank)is this?,” 12 asked for text to be read, 12 asked whether a par-ticular object was contained within the photograph, 5 askedfor a numerical answer or currency denomination, and 4 didnot fit into these categories.

Problems Taking Pictures 9 (11.0%) of the images takenwere too dark for the question to be answered, and 17 (21.0%)were too blurry for the question to be answered. Although afew other questions could not be answered due to the pho-tos that were taken, photos that were too dark or too blurrywere the most prevalent reason why questions could not beanswered. In the next section, we discuss a second iterationon the VizWiz prototype that helps to alert users to these par-ticular problems before sending the questions to workers.

Answers Overall, the first answer received was correct in71 of 82 cases (86.6%), where “correct” was defined as eitherbeing the answer to the question or an accurate descriptionof why the worker could not answer the question with theinformation contained within the photo provided (i.e., “Thisimage is too blurry”). A correct answer was received in allcases by the third answer.

The first answer was received across all questions in an aver-age of 133.3 seconds (SD=132.7), although the latency re-quired varied dramatically based on whether the questioncould actually be answered from the picture and on whetherthe speech recognition accurately recognized the question(Figure 4). Workers took 105.5 seconds (SD=160.3) on av-erage to answer questions that could be answered by the pro-vided photo compared to 170.2 seconds (SD=159.5) for those



•  Embedding of Crowdwork inside Tools –  Bernstein, et al. Solyent, UIST 2010

51



•  Embedding of Crowdwork inside Tools –  Bernstein, et al. Solyent, UIST 2010

•  Shepherding Crowdwork –  Dow et al. CHI2011 WIP

52

workers to persevere and accept additional tasks. We investigate these hypotheses through a prototype system, Shepherd, that demonstrates how to make feedback an integral part of crowdsourced creative work.

Understanding Opportunities for Crowd Feedback To effectively design feedback mechanisms that achieve the goals of learning, engagement, and quality improvement, we first analyze the important dimensions of the design space for crowd feedback (Figure 2).

Timeliness: When should feedback be shown? In micro-task work, workers stay with tasks for a short while, then move on. This implies two timing options: synchronously deliver feedback when workers are still engaged in a set of tasks, or asynchronously deliver feedback after workers have completed the tasks.

Synchronous feedback may have more impact on future task performance since it arrives while workers are still thinking about the task domain. It also increases the probability that workers will continue onto similar tasks. However, synchronous feedback places a burden on the feedback providers; they have little time to review work. This implies a need for tools or scheduling algorithms that enable near real-time feedback. Asynchronous feedback gives feedback providers more time to review and comment on work.

However, workers may have forgotten about the task or feel unmotivated to review the feedback and to return to the task.

Currently, platforms like Mechanical Turk only allow asynchronous feedback with no enticement to return. Requesters can provide feedback at payment time, but at that point (typically days later), workers care more about getting paid than improving submitted work. More importantly, unless requesters have more jobs available, workers cannot act on requesters’ advice.

Specificity: How detailed should feedback be? Mechanical Turk currently allows requesters one bit of feedback—accept or reject. While additional freeform communication is possible, it is rarely used unless workers file complaints. Workers may learn more if they receive detailed and personalized feedback on each piece of work. However, this added specificity comes at a price: feedback providers must spend time authoring feedback. When feedback resources are limited, customizable templates can accelerate feedback generation and enable requesters to codify domain knowledge into pre-authored statements. However, templates could be perceived as overly general or repetitive, reducing their desired impact. Workers may need explicit incentive to read and reflect on feedback.

Source: Who should provide feedback? Crowdsourcing requesters post tasks with specific quality objectives in mind; they are a natural choice for assuming the feedback role. However, experts often underestimate the difficulty novices face in solving tasks [7] or use language or concepts that are beyond the grasp of novices [6]. Moreover, as feedback

Figure 2: Current systems (in orange) focus on asynchronous, single-bit feedback by requesters. Shepherd (in blue) investigates richer, synchronous feedback by requesters and peers.

Tutorials

•  Matt Lease http://ir.ischool.utexas.edu/crowd/ •  AAAI 2011 (w HCOMP 2011): Human Computation: Core Research Questions

and State of the Art (E. Law & Luis von Ahn) •  WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to Work for

You (Omar Alonso and Matthew Lease) –  http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf

•  LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and Massimo Poesio)

–  http://lingpipe-blog.com/2010/05/17/

•  ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso) –  http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html

•  CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐Fei Li) –  http://sites.google.com/site/turkforvision/

•  CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) –  http://videolectures.net/cikm08_rose_cfre/

•  WWW2011: Managing Crowdsourced Human Computation (Panos Ipeirotis) –  http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation

53

Social Q&A on Twitter !!

S. Paul, L. Hong, E. Chi, ICWSM 2011

54 3/27/12

Why social Q&A? !

!

!

55 3/27/12

People turn to their friends on social networks because they trust their friends to provide tailored answers to subjective questions on niche topics.!

!

!

What kinds of questions are Twitter users asking their friends? !Types and topics of questions !!

Are users receiving responses to the questions they are asking? !Number, speed, and relevancy of responses !!

How does the nature of the social network affect Q&A behavior? !Size and usage of network, reciprocity of relationship !

Research Questions !!

!

58 3/27/12

Identifying question tweets was challenging !!

!Advertisement framed as question !

!

! Rhetorical question !

!

!Missing context !

!

!

59 3/27/12

Used heuristics to identify candidate tweets that were possibly questions !

!

•  Each Tweet classified by two Turkers !•  Each Turker classified 25 tweets: 20 candidates and 5

control tweets !•  Only accepted data from Turkers who classified all

control tweets correctly !

Classifying candidates tweets using Mechanical Turk !Crowd-sourced question tweet identification to Amazon Mechanical Turk !

!

!Control tweet !

!

!

60 3/27/12

1.2 million !12,000 ! (4,100 presented to Turkers) !

Applied heuristics to !identify candidate tweets !

Candidate tweets Random sample of public tweets

Tracked responses !to each candidate tweet !

Classified candidates ! using Mechanical Turk !

1152 !

624 !

Overall method for filtering questions !!

61 3/27/12

Rhetorical (42%), factual (16%), and poll (15%) questions were common !Significant percentage of personal & health (11%)questions !!

Findings: Types and topics of questions !!

!

62 3/27/12

How do you feel about interracial dating?

In UK, when you need to see a specialist, do you need special forms or permission?

entertainment 32%

technology 10% personal

& health 11%

ethics & philosophy

7%

gree8ngs 7%

current events 4%

restaurant/food 4%

professional 4%

uncategorized 5%

others 16%

Which team is better raiders or steelers?

Any good iPad app recommendations?

Any idea how to lost weight fast?

Question types ! Question topics !

Findings: Responses to questions !!

!Low (18.7%) response rate in general, but quick responses !

!0

1

2

3

4

5

6

7

8

log(n

umbe

r of q

uesti

ons)

0 1 2 3 4 5 6 7 8 10 16 17 28 29 39 147

Number of answers

Number of responses have a long tail distribution !

Most often reciprocity between asker and answerer was one-way (55%) !Responses were largely (84%) relevant!

! 63 3/27/12

Logistic regression modeling (structural properties) !!Number of followers (+) " "Number of tweets posted !Number of days on Twitter (+)" "Frequency of use of Twitter !Ratio of followers/followees (+) !Reciprocity rate (-)!

!!!

Findings: Social network characteristics !!

!

Which characteristics of asker predict whether she will receive a response? !

!

!

64 3/27/12

Network size and status in network are good predictors of whether asker will receive response !

Thanks!

•  [email protected]

•  http://edchi.net •  @edchi

•  Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008. Florence, Italy.

•  Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008. San Diego, CA. [Best Note Award]

66

crowdsourcing using mturk for hci research

Technology

highstability lowstability

highstability baseline

high quality articles

visualization conditions

interaction of visualization

visualization evaluation

quality of wikipedia

user data