crowd db : answering queries with crowdsourcing

57
Crowd DB: Answering Queries with Crowdsourcing ی ق ی صاد ب ت ج م ی، ن ما ح مد ر ح ی،ا ل ب ل ب ی ب ت ج م: دگان ب ه ه د" ارائ ی ل$ ب ع ما س ا خ$ ی) ش ر کت اد درس: د ب ش ا ه/ ت ف ر) ش3 ی5 په داده گا$ ای; درس : ب هار ب91 ان ب ش رد ک ات ق$ ی ق ح تم و و ل ع ی واحد م لا س اد ا رR ه ا گا) ش ن داSRB of Islamic Azad University, Kurdistan

Upload: mura

Post on 07-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Crowd DB : Answering Queries with Crowdsourcing. ارائه دهندگان: مجتبی بلبلی،احمد رحمانی،مجتبی صادقی استاد درس: دکتر شیخ اسماعیلی درس : پایگاه داده پیشرفته بهار91 دانشگاه آزاد اسلامی واحد علوم و تحقیقات کردستان. Relational database. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Crowd DB : Answering Queries with Crowdsourcing

Crowd DB: Answering Queries with

Crowdsourcing

ارائه دهندگان: مجتبی بلبلی،احمد رحمانی،مجتبی صادقی استاد درس: دکتر شیخ اسماعیلی

درس : پایگاه داده پیشرفته91بهار

دانشگاه آزاد اسالمی واحد علوم و تحقیقات کردستان

SRB of Islamic Azad University, Kurdistan

Page 2: Crowd DB : Answering Queries with Crowdsourcing

SRB of Islamic Azad University, Kurdistan

2

Relational database systems have achieved widespread adoption:

◦ business environments for which they were originally envisioned

◦ many types of structured data like personal, social, and even scientific information

as data creation and use become increasingly democratized through web, mobile and other technologies, the limitations of the technology are becoming more apparent !!

Relational database

Page 3: Crowd DB : Answering Queries with Crowdsourcing

3SRB of Islamic Azad University,

Kurdistan

correctnessunambiguitycompleteness

RDBMS

Page 4: Crowd DB : Answering Queries with Crowdsourcing

4SRB of Islamic Azad University,

Kurdistan

correctnessunambiguitycompleteness

RDBMS

No| Incomplete | Incorrect

Answers

Page 5: Crowd DB : Answering Queries with Crowdsourcing

5

One obvious situation where existing systems produce wrong answers is when they are missing information required for answering the question

SRB of Islamic Azad University, Kurdistan

Power to the People

Page 6: Crowd DB : Answering Queries with Crowdsourcing

6

return an empty answer if the company table instance in the database at that time does not contain a record for “I.B.M.”

Reasons :◦ a data entry mistake may have omitted the I.B.M. record or the record

may have been inadvertently deleted.

◦ the record was entered incorrectly, like “I.B.N.”

◦ Traditional systems can erroneously return an empty result even when no errors are made. Like “International Business Machines” instead of “IBM”

SRB of Islamic Azad University, Kurdistan

SELECT market_capitalization FROM companyWHERE name = "I.B.M.";

Page 7: Crowd DB : Answering Queries with Crowdsourcing

7

relational database systems are based on the Closed World Assumption (information that is not in the database is considered to be false or non-existent).

relational databases are extremely literal:◦ They expect that data has been properly cleaned

and validated before entry and do not natively tolerate inconsistencies in data or queries.

SRB of Islamic Azad University, Kurdistan

Fundamental problems

Page 8: Crowd DB : Answering Queries with Crowdsourcing

8

unless the relevance of pictures to specific topics has been previously obtained and stored, there is simply no good way to ask this question of a standard RDBMS

SRB of Islamic Azad University, Kurdistan

SELECT image FROM pictureWHERE topic = "Business Success"ORDER BY relevance LIMIT 1;

consider a query to find the best among a collection of images to use in a motivational slide show:

Page 9: Crowd DB : Answering Queries with Crowdsourcing

9SRB of Islamic Azad University,

Kurdistan

SELECT market_capitalization FROM companyWHERE name = "I.B.M.";

SELECT image FROM pictureWHERE topic = "Business Success"ORDER BY relevance LIMIT 1;

above queries, however, while unanswerable by today’s relational database systems, could easily be answered by people, especially people who have access to the Internet.

Page 10: Crowd DB : Answering Queries with Crowdsourcing

10

Recently, there has been substantial interest in harnessing Human Computation for solving problems that are impossible or too expensive to answer correctly using computers.

The approach we take in CrowdDB is to exploit the

extensibility of the iterator based query processing paradigm to add crowd functionality into a DBMS.

CrowdDB extends a traditional query engine with a small number of operators that solicit human input by generating and submitting work requests to a microtask crowdsourcing platform.

SRB of Islamic Azad University, Kurdistan

CrowdDB

Page 11: Crowd DB : Answering Queries with Crowdsourcing

11SRB of Islamic Azad University,

Kurdistan

Capabilities of humancomputation Vs. traditional query

processingFinding new data :

recall that a key limitation of relational technology stems from the Closed World Assumption. People, on the other hand, aided by tools such as search engines and reference sources, are quite capable of finding information that they do not have readily at hand.

Comparing data : people are skilled at making comparisons that are difficult or

impossible to encode in a computer algorithm. For example, it is easy for a person to tell, if given the right context, if “I.B.M.” and “International Business Machines” are names for the same entity. Likewise, people can easily compare items such as images along many dimensions, such as how well the image represents a particular concept.

Page 12: Crowd DB : Answering Queries with Crowdsourcing

12

by using SQL we create a declarative interface to the crowd .

Crowd DB provides Physical Data Independence for the crowd.

user interface design is a key factor in enabling questions to be answered by people. leverages schema information to support the automatic generation of effective user interfaces for crowd sourced tasks.

because of its declarative programming environment and operator-based approach, Crowd DB presents the opportunity to implement cost-based optimizations to improve query cost, time, and accuracy.

SRB of Islamic Azad University, Kurdistan

Benefits :

Page 13: Crowd DB : Answering Queries with Crowdsourcing

13

Crowdsourcing is a distributed problem solving and production model.

A crowdsourcing platform creates a marketplace on which requesters offer tasks and workers accept and work on the tasks

SRB of Islamic Azad University, Kurdistan

Crowdsourcing

Page 14: Crowd DB : Answering Queries with Crowdsourcing

14

is a crowdsourcing Internet marketplace that enables computer programmers (known as Requesters) to co-ordinate the use of human intelligence to perform tasks that computers are unable to do yet.

AMT provides an API for requesting and managing work, which enables us to directly connect it to the Crowd DB query processor.

In AMT, as part of specifying a task, a requester defines a price/reward (minimum $0.01) that the worker receives if the task is completed satisfactorily.

In AMT , workers from anywhere in the world can participate but requesters must be holders of a US credit card.

SRB of Islamic Azad University, Kurdistan

Amazon Mechanical Turk

Page 15: Crowd DB : Answering Queries with Crowdsourcing

15

HIT: ◦ is the smallest entity of work a worker can accept

to do Assignment:

◦ Every HIT can be replicated into multiple assignments.

HIT Group:◦ AMT automatically groups similar HITs together

into HIT Groups based on the requester, the title of the HIT, the description, and the reward.

SRB of Islamic Azad University, Kurdistan

Mechanical Turk Basics

Page 16: Crowd DB : Answering Queries with Crowdsourcing

16SRB of Islamic Azad University,

Kurdistan

Requester

HIT Groups

AMT Groups

HITs/ Assignment

Workers

Workflow :

Page 17: Crowd DB : Answering Queries with Crowdsourcing

17SRB of Islamic Azad University,

Kurdistan

Worker

Worker ‘s web interfaces

Main AMT interface

Provided by Requester

Page 18: Crowd DB : Answering Queries with Crowdsourcing

18SRB of Islamic Azad University,

Kurdistan

Mechanical Turk API

createHIT(title, description, question, keywords, reward, duration, max Assignments, lifetime)

HitID

Page 19: Crowd DB : Answering Queries with Crowdsourcing

19SRB of Islamic Azad University,

Kurdistan

Mechanical Turk API

getAssignmentsForHIT(HitID)

list(asnId, workerId , answer)

Page 20: Crowd DB : Answering Queries with Crowdsourcing

20SRB of Islamic Azad University,

Kurdistan

Mechanical Turk API

approveAssignment(asnID) / rejectAssignment(asnID)

reward to the worker and the commission to Amazon

/ no reward to the worker and the

commission to Amazon

Page 21: Crowd DB : Answering Queries with Crowdsourcing

21SRB of Islamic Azad University,

Kurdistan

Mechanical Turk API

forceExpireHIT(HitID)

Expires a HIT immediately

Page 22: Crowd DB : Answering Queries with Crowdsourcing

22

The crowd can be seen as a set of specialized processors.

Humans are good at certain tasks (e.g., image recognition)

machines are good at certain tasks (e.g., numbers).

SRB of Islamic Azad University, Kurdistan

Crowd

Human Machine Crowd

Page 23: Crowd DB : Answering Queries with Crowdsourcing

23

Performance and Variability Task Design and Ambiguity Affinity and Learning Relatively Small Worker Pool Open vs. Closed World

SRB of Islamic Azad University, Kurdistan

Key issues that have influenced the design of CrowdDB

Page 24: Crowd DB : Answering Queries with Crowdsourcing

24SRB of Islamic Azad University,

Kurdistan

CrowdDB

Page 25: Crowd DB : Answering Queries with Crowdsourcing

25

Turker Relationship Management User Interface Management HIT Manager

SRB of Islamic Azad University, Kurdistan

New Components

Page 26: Crowd DB : Answering Queries with Crowdsourcing

26

a SQL extension that supports crowdsourcing.

Involve missing data and subjective comparisons

SQL code in the same way as they do for traditional

databases.

developers are not aware that their code involves crowdsourcing.

SRB of Islamic Azad University, Kurdistan

CrowdSQL

Page 27: Crowd DB : Answering Queries with Crowdsourcing

27

SQL DDL Extensions

specific attributes of tuples could be crowdsourced.

entire tuples could be crowdsourced.

special keyword,CROWD

SRB of Islamic Azad University, Kurdistan

Incomplete Data

Page 28: Crowd DB : Answering Queries with Crowdsourcing

28SRB of Islamic Azad University,

Kurdistan

Crowdsourced Column

In the following Department table, the url is marked as crowdsourced.

CREATE TABLE Department (university STRING,name STRING,url CROWD STRING,phone STRING,PRIMARY KEY (university, name) );

Page 29: Crowd DB : Answering Queries with Crowdsourcing

29SRB of Islamic Azad University,

Kurdistan

Crowdsourced Table

This example models a Professor table as a crowdsourced table.

CREATE CROWD TABLE Professor (name STRING PRIMARY KEY,email STRING UNIQUE,university STRING,department STRING,FOREIGN KEY (university, department)REF Department(university, name) );

Page 30: Crowd DB : Answering Queries with Crowdsourcing

30SRB of Islamic Azad University,

Kurdistan

CrowdDB allows any column and any table to be marked with the CROWD keyword.

CrowdDB does not impose any limitations with regard to SQL types and integrity constraints

CROWD tables must have a primary key

Page 31: Crowd DB : Answering Queries with Crowdsourcing

31

DML (Data Manipulation Language )◦ SELECT - UPDATE - DELETE - INSERT INTO

DDL (Data Definition Language)◦ CREATE TABLE - ALTER TABLE - DROP TABLE - CREATE INDEX - DROP

INDEX

DCL (Data Control Language)◦ GRANT -REVOKE

SRB of Islamic Azad University, Kurdistan

Page 32: Crowd DB : Answering Queries with Crowdsourcing

32

In order to represent values in crowdsourced columns

It is also possible to specify the value of a CROWD column as part of an INSERT statement

SRB of Islamic Azad University, Kurdistan

SQL DML Semantics

INSERT INTO Department(university, name)VALUES ("UC Berkeley", "EECS");

INSERT INTO Department(university, name, url)VALUES ("ETH Zurich", "CS", "inf.ethz.ch");

Page 33: Crowd DB : Answering Queries with Crowdsourcing

33

CrowdDB supports any kind of SQL query on CROWD tables and columns

joins and sub-selects between two CROWD tables are allowed

SRB of Islamic Azad University, Kurdistan

Query Semantics

Page 34: Crowd DB : Answering Queries with Crowdsourcing

34SRB of Islamic Azad University,

Kurdistan

Subjective Comparisons

INSERT INTO Department(university, name, url)VALUES ("ETH Zurich", "CS", "inf.ethz.ch");

two new built in functions CROWDEQUAL and CROWDORDER.

CROWDEQUAL◦ takes two parameters (an lvalue and an rvalue)

Page 35: Crowd DB : Answering Queries with Crowdsourcing

35SRB of Islamic Azad University,

Kurdistan

CROWDORDER

CREATE TABLE picture (p IMAGE,subject STRING);

SELECT p FROM pictureWHERE subject = "Golden Gate Bridge"ORDER BY CROWDORDER (p,"Which picture visualizes better %subject");

rank or order results.

Page 36: Crowd DB : Answering Queries with Crowdsourcing

36SRB of Islamic Azad University,

Kurdistan

CrowdSQL in Practice

Minimal extension to SQL

CrowdSQL changes the closed-world to an open-world assumption

cost and response time of queries can be unbounded

provide a way to define a budget for a query

Only one way to specify a budget—using a LIMIT

Page 37: Crowd DB : Answering Queries with Crowdsourcing

37SRB of Islamic Azad University,

Kurdistan

Lineage

◦ query the lineage in order to take actions.

cleansing of crowdsourced data

◦ entity resolution of crowdsourced data◦ all CROWD tables must have a primary key

Page 38: Crowd DB : Answering Queries with Crowdsourcing

38SRB of Islamic Azad University,

Kurdistan

USER INTERFACE GENERATION

provision of effective user interfaces.

automatically generates user interfaces.

two-step process in CrowdDB

user interfaces are HTML (and JavaScript) forms

Page 39: Crowd DB : Answering Queries with Crowdsourcing

39SRB of Islamic Azad University,

Kurdistan

Basic Interfaces

The title of the HTML is the name of the Table. ask the worker to input the missing information. copying the known field values into the HTML

form. generates JavaScript code to check for the correct

types of input.

Page 40: Crowd DB : Answering Queries with Crowdsourcing

40SRB of Islamic Azad University,

Kurdistan

Multi-Relation Interfaces

the foreign key references a non-crowdsourced table.

generated user interface shows a drop-down box.

Ajax-based “suggest” function is provided.

CrowdDB supports two types of user interfaces:◦ normalized interface.◦ denormalized interface.

Page 41: Crowd DB : Answering Queries with Crowdsourcing

41SRB of Islamic Azad University,

Kurdistan

QUERY PROCESSING

query plan generation and execution follows a largely traditional approach.

The main differences are that the CrowdDB parser.

Page 42: Crowd DB : Answering Queries with Crowdsourcing

42SRB of Islamic Azad University,

Kurdistan

Crowd Operators

implements all operators of the relational algebra, just like any traditional database system.

initialized with a user interface template and the standard HIT parameters.

quality control is carried out by a majority vote.

Page 43: Crowd DB : Answering Queries with Crowdsourcing

43SRB of Islamic Azad University,

Kurdistan

CrowdDB has three Crowd operators:

CrowdProbe:◦ crowdsources missing information of CROWD columns.

CrowdJoin:◦ implements an index nested-loop join over two tables.

CrowdCompare:◦ implements the CROWDEQUAL and CROWDORDER functions

Page 44: Crowd DB : Answering Queries with Crowdsourcing

44SRB of Islamic Azad University,

Kurdistan

Physical Plan Generation

A query is first parsed. optimized using traditional and crowdspecific

optimizations. only predicate push-down was applied. the logical plan is translated into a physical plan

Page 45: Crowd DB : Answering Queries with Crowdsourcing

45SRB of Islamic Azad University,

Kurdistan

Heuristics

based on a simple rule-based optimizer.

implements several essential query rewriting rules such as predicate push-down.

annotates the query plan with the cardinality predictions between the operators.

re-order the operators to minimize the requests.

in contrast to a cost-based optimizer, a rule-based optimizer is not able to exhaustively explore all parameters.

Page 46: Crowd DB : Answering Queries with Crowdsourcing

46

We ran over 25,000 HITs on AMT and measured the response time and quality of the answers provided by the workers.

It is important to note that the results presented in this section are highly dependent on the properties of the AMT platform at a particular point in time.

our results are highly-dependent on the specific tasks we sent to the crowd

EXPERIMENTS AND RESULTS

SRB of Islamic Azad University, Kurdistan

Page 47: Crowd DB : Answering Queries with Crowdsourcing

47

We begin by describing the results of experiments with simple tasks requiring workers to find and fill in missing data for a table with two crowdsourced columns:

Micro Benchmarks

SRB of Islamic Azad University, Kurdistan

CREATE TABLE businesses ( name VARCHAR PRIMARY KEY, phone_number CROWD VARCHAR(32), address CROWD VARCHAR(256) );

Page 48: Crowd DB : Answering Queries with Crowdsourcing

48

In the first set of experiments we examined the response time of assignments as a function of the number of HITs in a HIT Group.

The results show that response times decrease dramatically as the size of the HIT Groups is increased.(figure 4)

Experiment 1: Response Time, Vary HIT Groups

SRB of Islamic Azad University, Kurdistan

Page 49: Crowd DB : Answering Queries with Crowdsourcing

49

This graph shows that there is a tradeoff between throughput (in terms of HITs completed per unit time) and completion percent.(Figure 5) That is, while the best throughput obtained was for the largest Group size (223.8 out of a Group of 400 HITs), the highest completion rates were obtained with roups of 50 or 100 HITs.

Experiment 1: Response Time, Vary HIT Groups(2)

SRB of Islamic Azad University, Kurdistan

Page 50: Crowd DB : Answering Queries with Crowdsourcing

50

Figure 6 shows the fraction of HITs (i.e., all five assignments) that were completed as a function of time. The figure shows the results for HITs with a reward of (from the bottom up) 1, 2, 3, and 4 cents per assignment. Obviously, for all rewards, the longer we wait the more HITs are completed. The results show that for this particular task, paying 4 cents gives the best performance, while paying 1 cent gives the worst.

Experiment 2: Responsiveness, Vary Reward

SRB of Islamic Azad University, Kurdistan

Page 51: Crowd DB : Answering Queries with Crowdsourcing

51

Comparing Figures 6 and 7, two observations can be made. First, within sixty minutes almost all HITs received at least one answer if a reward of 2, 3, or 4 cents was given (Figure

7), whereas only about 65% of the HITs were fully completed even for the highest reward of 4 cents (Figure 6). Second, if a single response to a HIT is acceptable, there is little incentive to pay more than 2 cents in this case.

Experiment 2: Responsiveness, Vary Reward(2)

SRB of Islamic Azad University, Kurdistan

Page 52: Crowd DB : Answering Queries with Crowdsourcing

52

The previous experiments focused on the response time and through-put of the crowd. Here, we turn our attention to two other issues: distribution of work among workers and answer quality.

In this experiment, every HIT had five assignments. Figure 8 shows the results. For each worker, Figure 8 shows

the number of HITs computed by that worker and the number of errors made by that worker. In Figure 8, the workers are plotted along the x-axis in decreasing order of the number of HITs they processed.

Experiment 3: Worker Affinity and Quality

SRB of Islamic Azad University, Kurdistan

Page 53: Crowd DB : Answering Queries with Crowdsourcing

53

For this experiment, we used a simple company schema with two attributes,company Name and headquarter address, and populated it with the Fortune 100 companies.

In Figure 9 we show results for four different instances of this query. Each HIT involved comparing ten company names with one of the four “non-uniform names”. Furthermore, each HIT had three assignments and the reward was 1 cent per HIT.

Complex Queries

SRB of Islamic Azad University, Kurdistan

Page 54: Crowd DB : Answering Queries with Crowdsourcing

54

Figure 10 shows the ranking of the Top 8 pictures for the “Golden Gate” subject area. More concretely, Figure 10 shows the picture,the number of workers that voted for that picture, the ranking of that picture according to the majority vote, vs.

Ordering Pictures

SRB of Islamic Azad University, Kurdistan

Page 55: Crowd DB : Answering Queries with Crowdsourcing

55

We highlighted two cases where human input is needed:

1_unknown or incomplete data, and 2_subjective comparisons.

We described simple extensions to the SQL DDL and DML for crowdsourcing.

Performance, in terms of latency and cost, is also effected by the way in which CrowdDB interacts with workers.

CONCLUSION

SRB of Islamic Azad University, Kurdistan

Page 56: Crowd DB : Answering Queries with Crowdsourcing

56

Of course, there is future work to do. Answer quality is a pervasive issue that must be addressed. Clear semantics in the presence of the open-world assumption are needed for for both crowd-sourced values and tables.

Cost-based optimization for the crowd involves many new parameters and considerations. Caching and managing crowdsourced answers, if done properly, can greatly improve performance as well.

CONCLUSION(2)

SRB of Islamic Azad University, Kurdistan

Page 57: Crowd DB : Answering Queries with Crowdsourcing

SRB of Islamic Azad University, Kurdistan

57

Any Question?

با تشکر از توجه شما