getting semantics from the crowd

14
Ge#ng Seman*cs from the Crowd Gianluca Demar*ni eXascale Infolab, University of Fribourg Switzerland

Upload: exascale-infolab

Post on 10-May-2015

557 views

Category:

Documents


3 download

DESCRIPTION

Talk given at the Dagstuhl seminar on Semantic Data Management, April 2012

TRANSCRIPT

Page 1: Getting Semantics from the Crowd

Ge#ng  Seman*cs  from  the  Crowd  

Gianluca  Demar*ni  eXascale  Infolab,  University  of  Fribourg  

Switzerland  

Page 2: Getting Semantics from the Crowd

Seman<c  Web  2.0  

 •  not  the  Web  3.0  

•  GeDng  seman<cs  from  (non-­‐expert)  people  –  From  few  publishers  and  many  consumers  (SW  1.0)  –  To  many  publishers  and  many  consumers  (SW  2.0)  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   2  

Page 3: Getting Semantics from the Crowd

read/write  SW  

•  WikidatahQp://meta.wikimedia.org/wiki/Wikidata  

 

•  Seman<cs  is  about  the  meaning  •  Get  people  in  the  loop!  •  Social  compu<ng  for  SemWeb  applica<ons  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   3  

Page 4: Getting Semantics from the Crowd

Crowdsourcing  

•  Exploit  human  intelligence  to  solve  – Tasks  simple  for  humans,  complex  for  machines  – With  a  large  number  of  humans  (the  Crowd)  – Small  problems:  micro-­‐tasks  (Amazon  MTurk)  

•  Examples  – Wikipedia,  Flickr  

•  Incen<ves  – Financial,  fun,  visibility  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   4  

Page 5: Getting Semantics from the Crowd

Crowdsourcing  

•  Success  Stories  – Training  set  for  ML  –  Image  tagging  – Document  annota<on/transla<on  –  IR  evalua<on  [Blanco  et  al.  SIGIR  2011]  – CrowdDB  [Franklin  et  al.  SIGMOD  2011]  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   5  

Page 6: Getting Semantics from the Crowd

Crowd-­‐powered  SW  apps  

•  En<ty  Linking  [ZenCrowd  at  WWW12]  •  Create/validate  sameAs  links  •  Schema  matching  

•  ...  Add  your  own  favorite  applica<on!  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   6  

HTML+ RDFaPages

LOD Cloud

Page 7: Getting Semantics from the Crowd

ZenCrowd  

•  Combine  both  algorithmic  and  manual  linking  •  Automate  manual  linking  via  crowdsourcing  •  Dynamically  assess  human  workers  with  a  probabilis<c  reasoning  framework  

27-­‐Apr-­‐12   7  

Crowd  

Algorithms  Machines  

Page 8: Getting Semantics from the Crowd

ZenCrowd  Architecture  

Micro Matching

Tasks

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Micr

o-Ta

sk M

anag

er

Workers Decisions

AlgorithmicMatchers

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   8  

Page 9: Getting Semantics from the Crowd

The  micro-­‐task  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   9  

Page 10: Getting Semantics from the Crowd

En<ty  Factor  Graphs  

•  Graph  components  – Workers,  links,  clicks  – Prior  probabili<es  – Link  Factors  – Constraints  

•  Probabilis<c  Inference  – Select  all  links  with  posterior  prob  >τ  

w1 w2

l1 l2

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

l3

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

2  workers,  6  clicks,  3  candidate  links  

Link  priors  

Worker  priors  

Observed  variables  

Link  factors  

SameAs  constraints  

Dataset  Unicity  constraints  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   10  

Page 11: Getting Semantics from the Crowd

ZenCrowd:  Lessons  Learnt  

•  Crowdsourcing  +  Prob  reasoning  works!  •  But  – Different  worker  communi<es  perform  differently  – No  differences  w/  different  contexts  – Comple<on  <me  may  vary  (based  on  reward)  – Many  low  quality  workers  +  Spam  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   11  

Page 12: Getting Semantics from the Crowd

ZenCrowd  

•  Worker  Selec<on  

Top$US$Worker$

0$

0.5$

1$

0$ 250$ 500$

Worker&P

recision

&

Number&of&Tasks&

US$Workers$

IN$Workers$

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   12  

Page 13: Getting Semantics from the Crowd

Challenges  for  Crowd-­‐SW  

•  How  to  design  the  micro-­‐task  •  Where  to  find  the  crowd  – MTurk,  Facebook  (900M  users)  

•  Evalua<on  – Which  ground  truth?!  

•  Quality  control  /  Spam  – Need  for  spam  benchmarks  in  Crowdsourcing  [Mechanical  Cheat  at  CrowdSearch  2012]  

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   13  

Page 14: Getting Semantics from the Crowd

27-­‐Apr-­‐12   Gianluca  Demar<ni,  eXascale  Infolab   14