getting semantics from the crowd
DESCRIPTION
Talk given at the Dagstuhl seminar on Semantic Data Management, April 2012TRANSCRIPT
Ge#ng Seman*cs from the Crowd
Gianluca Demar*ni eXascale Infolab, University of Fribourg
Switzerland
Seman<c Web 2.0
• not the Web 3.0
• GeDng seman<cs from (non-‐expert) people – From few publishers and many consumers (SW 1.0) – To many publishers and many consumers (SW 2.0)
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 2
read/write SW
• WikidatahQp://meta.wikimedia.org/wiki/Wikidata
• Seman<cs is about the meaning • Get people in the loop! • Social compu<ng for SemWeb applica<ons
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 3
Crowdsourcing
• Exploit human intelligence to solve – Tasks simple for humans, complex for machines – With a large number of humans (the Crowd) – Small problems: micro-‐tasks (Amazon MTurk)
• Examples – Wikipedia, Flickr
• Incen<ves – Financial, fun, visibility
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 4
Crowdsourcing
• Success Stories – Training set for ML – Image tagging – Document annota<on/transla<on – IR evalua<on [Blanco et al. SIGIR 2011] – CrowdDB [Franklin et al. SIGMOD 2011]
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 5
Crowd-‐powered SW apps
• En<ty Linking [ZenCrowd at WWW12] • Create/validate sameAs links • Schema matching
• ... Add your own favorite applica<on!
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 6
HTML+ RDFaPages
LOD Cloud
ZenCrowd
• Combine both algorithmic and manual linking • Automate manual linking via crowdsourcing • Dynamically assess human workers with a probabilis<c reasoning framework
27-‐Apr-‐12 7
Crowd
Algorithms Machines
ZenCrowd Architecture
Micro Matching
Tasks
HTMLPages
HTML+ RDFaPages
LOD Open Data Cloud
CrowdsourcingPlatform
ZenCrowdEntity
Extractors
LOD Index Get Entity
Input Output
Probabilistic Network
Decision Engine
Micr
o-Ta
sk M
anag
er
Workers Decisions
AlgorithmicMatchers
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 8
The micro-‐task
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 9
En<ty Factor Graphs
• Graph components – Workers, links, clicks – Prior probabili<es – Link Factors – Constraints
• Probabilis<c Inference – Select all links with posterior prob >τ
w1 w2
l1 l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11 c22c12c21 c13 c23
u2-3( )sa1-2( )
2 workers, 6 clicks, 3 candidate links
Link priors
Worker priors
Observed variables
Link factors
SameAs constraints
Dataset Unicity constraints
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 10
ZenCrowd: Lessons Learnt
• Crowdsourcing + Prob reasoning works! • But – Different worker communi<es perform differently – No differences w/ different contexts – Comple<on <me may vary (based on reward) – Many low quality workers + Spam
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 11
ZenCrowd
• Worker Selec<on
Top$US$Worker$
0$
0.5$
1$
0$ 250$ 500$
Worker&P
recision
&
Number&of&Tasks&
US$Workers$
IN$Workers$
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 12
Challenges for Crowd-‐SW
• How to design the micro-‐task • Where to find the crowd – MTurk, Facebook (900M users)
• Evalua<on – Which ground truth?!
• Quality control / Spam – Need for spam benchmarks in Crowdsourcing [Mechanical Cheat at CrowdSearch 2012]
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 13
27-‐Apr-‐12 Gianluca Demar<ni, eXascale Infolab 14