crowdsourcing massimo poesio part 2: games with a purpose

CROWDSOURCING

Massimo Poesio

Part 2: Games with a Purpose

GAMES WITH A PURPOSE

• Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK

• GWAP do not rely on altruism or financial incentives to entice people to perform certain actions

• The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP

• Games at www.gwap.com– ESP– Verbosity– TagATune

• Other games– Peekaboom– Phetch

• The first GWAP developed by von Ahn and their group (2003 / 2004)

• The problem: obtain accurate description of images to be used– To train image search engines– To develop machine learning approaches to vision

• The goal: label the majority of the images on the Web

ESP: the game

ESP: THE GAME

• Two partners are picked at random from the large number of players online

• They are not told who their partner is, and can’t communicate with them

• They are both shown the same image• The goal: guess how their partner will describe the

image, and type that description– Hence, the ESP game

• If any of the strings typed by one player matches the string typed by the other player, they score points

THE TASK

SCORING BY MATCHING

THE CHALLENGE: SCORES

• One of the motivating factors is to try to score as many points as possible

• Hourly, daily, weekly, and monthly scores are shown

SCORES

THE CHALLENGE: TIMING

• Partners try to agree on as many images as they can during 2 ½ minutes

• The termometer on the side indicates how many images they have agreed on

• If they agree on 15 images they score bonus points

TABOO WORDS

• To ensure the production of a large number of specific labels, some words are declared TABOO and not allowed

• Taboo words are obtained from the game itself: any word that has been agreed upon by players who were shown a picture earlier becomes a taboo word for that image

TABOO WORDS

PASSING

GOOD LABELS, COMPLETING AN IMAGE

• A label is considered “good” when more than N players produce it (with N a parameter of the game)

• An image is “done” when its list of taboo words is so extensive that most players pass on it

IMPLEMENTATION

• Pre-recorded game play– Especially at the beginning, and at quiet times, there

won’t always be players to pair with– In these cases a player is paired against a recorded ‘hand’

of a previous game with the same picture• Cheating

– Players could cheat in a number of ways, including agreeing on labels / playing against themselves

– A number of mechanisms are in place against those cases• Selecting images

SOME STATISTICS

• In the 4 months between August 9th 2003 and December 10th 2003– 13630 players– 1.2 million labels for 293,760 images– 80% of players played more than once

• By 2008: – 200,000 players– 50 million labels

ANALYSIS

• The numbers indicate that the game is fun to play

• Exciting factors:– Playing with a partner– Playing against time

QUALITY OF THE LABELS

• For IMAGE SEARCH:– choose 10 labels among those produced and look at which images

are returned• Compare labels produced by players with labels produced by

participants in an experiment– 15 participants, 20 images among the 1000 with more than 5

labels– 83% of game labels also produced by participants

• Manual assessment of labels (‘would you use these labels to describe this image?’)– 15 participants, 20 images– 85% of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

VERBOSITY

• … or, the game approach to collecting commonsense knowledge

• Motivation: slow progress both on CYC (5 million facts collected) and on Open Mind Commonsense (around 700,000 facts)

THE GAME

• Based on an existing game, TABOO:– Players have to guess a word– One of the players gives hints concerning the word

• In Verbosity, you have two players, the DESCRIBER and the GUESSER, and a SECRET WORD

THE GAME

TEMPLATES IN VERBOSITY

• As in Open Mind Commonsense, templates are used to ensure that the relations / properties of interest are collected

• The Describer produces hints by filling in a template

GUESSING ATTRIBUTES

PRODUCING A DESCRIPTION

TEMPLATES

• _ is a kind of _• _ is used for _• _ is typically near/in/on _• _ is the opposite of _ / _ is related to _

EMULATION

• As in ESP game, pre-recorded games are used when a player cannot be paired with another player

• The asymmetry of the game causes a problem not encountered in ESP game– Describer: can just repeat behavior of previous

describer– Guesser: not so easy

RESULTS

• Only published results I’m aware of predate the actual release of the game so I don’t know about the QUANTITY

• Quality:– Ask six raters whether 200 facts collected using

Verbosity are ‘true’– Around 85% success

PEEKABOOM

• Objective: collect data about the presence of objects in images in order to train vision algorithms for object detection

THE GAME

• Two players• They take turns at playing ‘Peek’ and ‘Boom’• ‘Boom’ gets a picture with an associated word;

‘Peek’ has to guess what is the associated word

• ‘Boom’ reveals parts of a picture to ‘Peek’ by clicking on it (each click reveals a circular area of 20 pixels of radius)

THE GAME: PEEK

THE GAME

IMPLEMENTATION

• Images and their labels come from ESP• Cheating:

– Player queue (wait until next ‘matching interval’ – one every 10 seconds – to start playing)

– IP address checks (to make sure players are not paired with themselves)

– Blocking bots: ‘seed images’ (previously annotated) and blacklist

EVALUATION: USER STATISTICS

• Usage:– 1 month in 2005 – 14,153 players– 1,122,998 completed rounds– Average person played around 158 images (or 72

minutes)

EVALUATION: ACCURACY OF DATA

• Accuracy of bounding boxes– Choose 50 images played by at least two pairs– Have four volunteers make bounding boxes– OVERLAP(A,B) = AREA(A∩B) / AREA(A B)∪– Average: 0.75

• Accuracy of pings– 50 images as above– Three subject decide if ping is ‘inside the object’– Result: 100%

SOME GENERAL LESSONS

• von Ahn & Dabbish (2008) discuss the general approach and some lessons they took from their work

THREE TEMPLATES

• OUTPUT AGREEMENT GAMES– Generalization of ESP

• INVERSION-PROBLEM GAMES• INPUT-AGREEMENT GAMES

OUTPUT AGREEMENT GAMES

• Two strangers are chosen among all potential players. They cannot see each other or communicate with each other.

• In each round, both are given the same input• Game instructions say that players should

produce same output as their partners• Winning condition: they produce the same

output, possibly after a few attempts

E.g.: ESP GAME.

INVERSION PROBLEM GAMES• Two strangers are chosen among all potential

players. They cannot see each other or communicate with each other.

• In each round, one player is designated as the DESCRIBER whereas the other is designated as the GUESSER. The output from the describer should help the guesser guess the original input

• WINNING CONDITION: The guesser correctly guesses the input originally assigned to the describer.

E.g.: VERBOSITY. Based on ‘20 Questions’.

INPUT AGREEMENT GAMES

• Two strangers are chosen among all potential players. They cannot see each other or communicate with each other.

• In each round, both are given input that is known by the game (but not by the players) to be the same or different

• Game instructions say that players should produce output describing their input so that they can decide whether input is same or different

• Winning condition: playing partners correctly decide whether input is same or different.

E.g.: TagATune.

INCREASE ENJOYMENT

• Games designed so as to make the task enjoyable

• GWAPs by von Ahn et al attempt to do this by giving players a CHALLENGE:– TIMED RESPONSE– SCORE KEEPING– SKILL LEVELS– HIGH SCORE LEVELS

OUTPUT ACCURACY

• Mechanisms to ensure correctness and avoid collusions (e.g., always produce the same label)– Random matching (players don’t know each other’s

identity)– Player testing (assess quality of particular player’s

input by matching his output against already annotated data)

– Repetition (output only considered correct if many players produced it)

– Taboo

MISCELLANEOUS

• Other useful ideas• Evaluation

– Efficiency: THROUGHPUT (T)– ‘Enjoyability’: AVERAGE LIFETIME PLAY (ALP)– Combined measure:

EXPECTED CONTRIBUTION = T * ALP

OTHER GAMES

• On gwap.com– TagATune

• Elsewhere:– FoldIt– Karaoke Callout– PheTch– Spectral Game

FOLDIT

THE PROBLEM: PROTEIN FOLDING

Petsko G.A., Ringe, D., Protein Structure and Function 2004, figure 5-5, pg. 173.

REPRESENTING PROTEIN STRUCTUREWire diagram Ribbon diagram Ball & stick of

featured area

Space filling:van der Waals

Surface representation (GRASP image)

Blue: positiveRed: negative

THE GAME

INTRO: https://www.youtube.com/watch?v=bo99JjnfdA8DETAILED EXAMPLE: https://www.youtube.com/watch?v=lGYJyur4FUA

EVALUATION

PROBLEMS SOLVED BY FOLDIT PLAYERS

GWAPs for NLP

• Lexical Resource Creation:– (Verbosity)– Jeux de Mots– Groningen Meaning Bank

• Corpus annotation:– The GIVE challenge– Phatris– Phrase Detectives (next lecture)– The sentiment game

JEUX DE MOTS

• A game to acquire a ‘lexical-semantic network’: a knowledge base with information about– Concepts– Their lexical associations– Their conceptual relations (ISA, PART-OF, etc)

• Developed by Mathieu Lafourcade• Since 2007

BASICS

• A two-player game• The players do not know each other (as in

Verbosity etc)

ENTERING LEXICAL ASSOCIATIONS

Target word+ instructions

player 1 player 2

propositions

intersection

Game playSCORING

Game play

player 1 player 2

propositions

intersection

accordance

SCORING

Game play

Mot cible+ consigne

player 1 player 2

propositions

Mot cible+ consigne

propositions

intersection

accordance

Reward

SCORING

RESULTS OF A GAME

RESULTS SO FAR

• 1,375,432 games played since 2007– Over 9 million relations entered

• Results of game(s): dictionary called DIKO

THE GIVE CHALLENGE

• Generating Instructions in Virtual Environments

• A shared task for the NLG community• Users evaluate systems by playing a game in

which the instructions are generated by NLG systems

REFERENCES

• L. von Ahn and L. Dabbish (2008). Designing games with a purpose. Communications of the ACM, v. 51, n.8, 58-67

• L. von Ahn and L. Dabbish (2004). Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 319–326.

• von Ahn, L., Liu, R., and Blum, M. (2006). Peekaboom. A Game for locating objects in images. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 55–64.

• www.gwap.com• Luis von Ahn’s talk on Human Computation at Google talks

crowdsourcing massimo poesio part 2: games with a purpose

game labels

good labels

returnedcompare labels

n players

casesselecting images

game esp

previous game

accurate description

Documents

computational approaches to reference jeanette gundel &...

crowdsourcing massimo poesio part 4: dealing with...

introduction to artificial intelligence massimo poesio...

empirical investigations of anaphora massimo poesio...

807 - text analytics massimo poesio lecture 1: introduction,...

807 - text analytics massimo poesio lecture 7: coreference...

naïve bayes for text classification: spam detection cis 391...

massimo poesio / tommaso fornaciari ˚een mary …...massimo...

elerfed – end of workshop report massimo poesio (trento /...

gundel & poesio - computational approaches to reference

repetition and variation in child-language corpora to...

computational approaches to reference jeanette k. gundel and...

introduction to artificial intelligence massimo poesio...

807 - text analytics massimo poesio lecture 5: named entity...

completions and continuations in dialogue: a preliminary...

chapter 14 using lexical and encyclopedic knowledge · 412...

807 - text analytics massimo poesio lecture 10:...

1 comments to: referential structures and links (s....

citation - cs · inter-coder agreement for computational...

introduction to artificial intelligence massimo poesio...