query optimization over crowdsourced data

Click here to load reader

Post on 09-Jul-2015

3.734 views

Category:

Software

2 download

Embed Size (px)

DESCRIPTION

Presented in VLDB 2013.

TRANSCRIPT

  • Query Optimization over Crowdsourced Data

    Hyunjung Park, Jennifer Widom Stanford University

  • Deco: Declarative Crowdsourcing

    Give me a Spanish-speaking country.

    Give me a country. What language do they speak in country X? What is the capital of country X?

    8/27/2013 Hyunjung Park 2

    Find the capitals of eight Spanish-speaking countries

    DBMS

    country language capital

    Italy Italian Rome

    Spain Spanish Madrid

    country language capital

    Italy Italian Rome

    Spain Spanish Madrid

    Deco System

  • Deco Query Optimization

    Crowd incurs monetary cost Some query plans are much cheaper than others Cost estimation is complicated by: Previously collected data Unknown database state Inconsistency of human answers

    8/27/2013 Hyunjung Park 3

  • Outline

    Motivating example Deco data model and queries Cost and cardinality estimation Experimental results

    8/27/2013 Hyunjung Park 4

    Everything implemented in full prototype

  • Motivating Example: Plan 1

    8/27/2013 Hyunjung Park 5

    Give me a country.

    What language do they speak in country X?

    What is the capital of country X?

    unseen

    Spanish

    F

    T

    T

    F

    Find the capitals of eight Spanish-speaking countries

    8x

  • Give me a country. Give me a country. Give me a country.

    Motivating Example: Plan 2

    8/27/2013 Hyunjung Park 6

    Give me a Spanish-speaking country.

    What language do they speak in country X?

    What is the capital of country X?

    unseen

    Spanish

    F

    T

    T

    F

    Find the capitals of eight Spanish-speaking countries

    8x

  • Preview of Experimental Results

    0

    5

    10

    15

    Plan 1 Plan 2

    Actual costs spent on Mechanical Turk

    What is the capital of country X?

    What language do they speak in country X?

    Give me a Spanish-speaking country.

    Give me a country.

    8/27/2013 Hyunjung Park 7

    ($)

  • Outline

    Motivating example Deco data model and queries Cost and cardinality estimation Experimental results

    8/27/2013 Hyunjung Park 8

  • Deco: Data Model (1/2)

    Conceptual Relation: visible to end-users Country (country, language, capital)

    Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3)

    capital: majority(3)

    8/27/2013 Hyunjung Park 9

  • Deco: Data Model (2/2)

    Fetch Rules: access methods for the crowd language => country

    Give me a {language}-speaking country.

    => country Give me a country.

    country => language What language do they speak in {country}?

    country => capital What is the capital of {country}?

    8/27/2013 Hyunjung Park 10

    [$0.05]

    [$0.01]

    [$0.02]

    [$0.03]

  • Deco: Queries

    Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=Spanish MINTUPLES 8

    Query processor: access the crowd as needed to produce query result while: 1. Minimizing monetary cost 2. Reducing latency

    8/27/2013 Hyunjung Park 11

    query optimizer

    query execution engine

  • Query Optimization

    Find the best query plan in terms of estimated monetary cost

    As in traditional query optimizer 1. Cost and cardinality estimation 2. Search space 3. Plan enumeration algorithm

    8/27/2013 12 Hyunjung Park

  • Cost Estimation

    Total monetary cost = Fetch F F.price F.cardinality Existing data is free

    Definition of Cardinality in Deco Total number of expected output tuples from operator

    until query execution terminates

    Cardinality estimation Final database state needs to be estimated

    simultaneously

    8/27/2013 Hyunjung Park 13

  • Cardinality Estimation: Setting

    $0.05 for all fetch rules

    No existing data Selectivity factors language=Spanish: 0.1 dupElim: 0.8 majority(3): 0.4 (=1/2.5)

    8/27/2013 Hyunjung Park 14

  • Cardinality Estimation: Plan 1

    8/27/2013 15 Hyunjung Park

    SELECT country, capital FROM Country WHERE language=Spanish MINTUPLES 8

    MinTuples[8]

    Project[co,ca]

    DLOJoin[co]

    DLOJoin[co]

    Resolve[dupeli] Resolve[maj3]

    Resolve[maj3]Filter[la=Spanish]

    Scan[CtryA]

    Fetch[co]

    Scan[CtryD2]

    Fetch[coca]

    Scan[CtryD1]

    Fetch[cola]

    1

    2

    3

    4 12

    5 13

    96

    7 8 10 11

    14

    => country country => language country => capital

    Cost estimation: $0.05(100+200+20) = $16.00 200

    20

    100

  • Cardinality Estimation: Plan 2

    8/27/2013 16 Hyunjung Park

    MinTuples[8]

    Project[co,ca]

    DLOJoin[co]

    DLOJoin[co]

    Resolve[dupeli] Resolve[maj3]

    Resolve[maj3]Filter[la=Spanish]

    Scan[CtryA]

    Fetch[laco]

    Scan[CtryD2]

    Fetch[coca]

    Scan[CtryD1]

    Fetch[cola]

    1

    2

    3

    4 12

    5 13

    96

    7 8a 10 11

    14

    SELECT country, capital FROM Country WHERE language=Spanish MINTUPLES 8

    language => country country => language country => capital

    Cost estimation: $0.05(10+20+20) = $2.50 20 10

    20

  • 8/27/2013 Hyunjung Park 17

    0

    1

    2

    3

    Actual

    Plan 2

    Experimental Results

    0

    5

    10

    15

    Actual

    Plan 1

    country => capital country => language language => country => country

    ($) ($)

  • 8/27/2013 Hyunjung Park 18

    0

    1

    2

    3

    Actual Estimated

    Plan 2

    Experimental Results

    0

    5

    10

    15

    Actual Estimated

    Plan 1

    country => capital country => language language => country => country

    ($) ($)

  • Related Work

    Declarative approach for crowdsourcing Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...

    Crowd-powered algorithms/operations Filter, sort, join, max, entity resolution,

    Also: Traditional query optimization Heterogeneous or federated database systems

    8/27/2013 19 Hyunjung Park

  • Summary

    Cost estimation in Deco Distinguish between existing data vs. new data Estimate cardinality and final database state

    simultaneously

    In the paper: Full description of cost estimation and plan

    enumeration algorithms

    More experimental results

    8/27/2013 Hyunjung Park 20

  • Thank you!