efficient scheduling of heterogeneous continuous queries mohamed a. sharaf panos k. chrysanthis...

32
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management Technologies Lab Department of Computer Science University of Pittsburgh VLDB 2006

Upload: tyrone-wilkins

Post on 11-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

Efficient Scheduling of Heterogeneous Continuous

Queries

Mohamed A. Sharaf Panos K. ChrysanthisAlexandros Labrinidis

Kirk Pruhs

Advanced Data Management Technologies LabDepartment of Computer Science

University of Pittsburgh

VLDB 2006

Page 2: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 2 Sharaf, Chrysanthis, Labrinidis, Pruhs

Motivating Example

Tell me when there are airplane tickets such that: Itinerary: Pittsburgh -> Korea -> Pittsburgh Dates: September 8 -> September 16 Price < $1200

This is a form of a Continuous Query (CQ): CQs registered ahead of time Arrival of new data triggers execution

CQs support monitoring applications: <insert your favorite monitoring application here>

Page 3: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 3 Sharaf, Chrysanthis, Labrinidis, Pruhs

Data Stream Management System (DSMS)

DSMS = Database system + Online system Our Goal: Improve the online performance of a DSMS

Input DataStreams

Output Data Stream D1

Query Scheduler

Continuous Query Qn

1 2 3

Output Data Stream Dn

Load Shedder

Memory ManagerQuery Optimizer

Query Scheduler

1 2 3

Continuous Query Q1

Page 4: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 4 Sharaf, Chrysanthis, Labrinidis, Pruhs

Need for Query Scheduling

The execution order of continuous queries determines the overall behavior of the system e.g., memory usage [Babcock et. al., SIGMOD’03]

Traditionally:

One operator per thread

Resource management done by OS

Problems:

No objective for optimization

Does not exploit query semantics

Page 5: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 5 Sharaf, Chrysanthis, Labrinidis, Pruhs

Scheduling Multiple Continuous Queries (MCQ)

Given:

A set of n queries ready to execute

(queries with pending updates)

A certain metric to optimize

Then:

The MCQ Scheduler decides the

execution order of the n queries so

that to optimize the given metric

1

2

3

1

2

3

1

2

3

CQ1 CQ2 CQn

Page 6: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 6 Sharaf, Chrysanthis, Labrinidis, Pruhs

Outline

Introduction

Scheduling for Quality of Service (QoS)

Average response time

Average slowdown

Balancing the trade-off between average and worst case

Implementation issues

Conclusions

Page 7: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 7 Sharaf, Chrysanthis, Labrinidis, Pruhs

Response Time

The response time of a tuple is the interval of time between its arrival at the DSMS until its departure

Tuples that are filtered out (discarded) during query processing do not contribute to the metric

Shortest Remaining Processing Time (SRPT) is the policy to optimize response time in Web servers

Would SRPT optimize response time for multiple CQs ?! No … because it does not exploit CQs characteristics!

Page 8: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 8 Sharaf, Chrysanthis, Labrinidis, Pruhs

Impact of Selectivity

Selectivity of a query (S): is the probability of producing an output tuple after processing an input tuple (i.e., detecting a related event)

S=0.1: 10 input tuples 1 output event

S=1.0: 10 input tuples 10 output events

If two queries have the same cost then:

the one with higher selectivity produces more tuples per time unit (higher Output Rate).

Page 9: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 9 Sharaf, Chrysanthis, Labrinidis, Pruhs

Impact of Output Rate

Q1: S1=1.0 and C1=1 mS then OR1=1.0

Q2: S2=0.2 and C2=1 mS then OR2=0.2 5 pending tuples arrived at time 0 RT

Q2 then Q1 12.2

Q1 then Q2 7.1

Q2 Q2

5 10

Q1 Q1

Q1 then Q2

Q1 Q1 Q1 Q2 Q2 Q2

0

Q2 then Q1

Q2 Q2

5 10

Q1 Q1 Q1 Q1 Q1Q2 Q2 Q2

0

Page 10: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 10 Sharaf, Chrysanthis, Labrinidis, Pruhs

Highest Rate Policy

Assign each query a priority equal to its output rate

The output rate of a query = selectivity/cost

How to compute the output rate of a query with more than one operator ?

1

1

c

s

C

SGR

avgi

ii ==

1

2

121

21

scc

ss

C

SGR

avgi

ii ×+

×==

3

213121

321

sscscc

sss

C

SGR

avgi

ii ××+×+

××==

At each scheduling point, schedule the query with the highest global output rate…Highest Rate Policy (HR)

Page 11: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 11 Sharaf, Chrysanthis, Labrinidis, Pruhs

Simulation Testbed

Developed a DSMS simulator in C++

Policies for multi-query scheduling: Round Robin (RR; Aurora) Highest Rate (HR) First Come First Serve (FCFS) Shortest Remaining Processing Time (SRPT)

Input traces from Internet traffic

Generate 500 continuous queries: select-join-project Uniform distribution of costs and selectivities

Assigned costs and selectivities determine the system’s utilization (or load)

Page 12: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 12 Sharaf, Chrysanthis, Labrinidis, Pruhs

Utilization

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Avg. Response Time (

μ )Sec

0.0

5.0e+5

1.0e+6

1.5e+6

2.0e+6

2.5e+6 RRFCFSSRPTHR

Avg.

Resp

onse

Tim

e

(μSec)

Results: Average Response Time

65%

73%

Page 13: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 13 Sharaf, Chrysanthis, Labrinidis, Pruhs

Outline

Introduction

Scheduling for Quality of Service (QoS)

Average response time

Average slowdown

Balancing the trade-off between average and worst case

Implementation issues

Conclusions

Page 14: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 14 Sharaf, Chrysanthis, Labrinidis, Pruhs

Slowdown

Slowdown (or stretch): [Mehta & DeWitt, VLDB’93]

Ratio between the tuple’s response time to its ideal processing time if it were the only tuple in the system

slowdown is more fair than response time:

It relates response time to demand: tuples for an expensive query are expected to stay longer as they contribute more to the load

Ideally, slowdown = 1 Slowdown increases with increasing load

Page 15: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 15 Sharaf, Chrysanthis, Labrinidis, Pruhs

SRPT vs. HR

In Web Servers, SRPT is: Optimal for response time, and Near optimal for slowdown Short jobs spend shorter time in the system

In DSMSs: HR minimizes average response time but what about average slowdown ?

Is it possible under HR for short queries to experience high slowdown leading to an overall high slowdown ?

Queries with low selectivity are penalized !

Page 16: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 16 Sharaf, Chrysanthis, Labrinidis, Pruhs

Example

Q1: S1=1.0 and C1=5 mS then OR1=0.2

Q2: S2=0.33 and C2=2 mS then OR2=0.17 3 pending tuples arrived at time 0

Q2 Q2 Q2

5 10 15 17 19 21

SD=1 SD=2 SD=3 SD=9.5

Q1 Q1 Q1

HR policy:

Q1 Q1 Q1

11 16 212 4 6

SD=2 SD=2.2 SD=3.2 SD=4.2

Q2 Q2 Q2

Another policy:

RT SD

HR 12.2 3.8

Other 13 2.9

Page 17: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 17 Sharaf, Chrysanthis, Labrinidis, Pruhs

Parameters for Scheduling

Sx = s1 * s2 * s3

Cxavg = c1 + (c2*s1) + (c3*s1*s2)

Cx = cost of detecting an event

= c1+ c2+c3

= ideal processing time

Wx = the current wait time of the oldest tuple in Qx input queue

1

2 ∞

3

Page 18: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 18 Sharaf, Chrysanthis, Labrinidis, Pruhs

Scheduling for Slowdown (1)

Compute slowdown (H) under two policies: Policy X: first Q1 then Q2

Policy Y: first Q2 then Q1

Probability that t1 is

producedWait time

Extra wait time for Q1

to finish execution

t 1

Q1

W1

C1avg

S1

Q2

W2

C2avg

S2

Processing time

t 2

t1’s slowdown t2’s slowdown

2

2122

1

111 C

CCWS

C

CWSH

avg

x

++×+

+×=

C2C1

Page 19: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 19 Sharaf, Chrysanthis, Labrinidis, Pruhs

Scheduling for Slowdown (2)

Under policy X: first Q1 then Q2

2

2122

1

111 C

CCWS

C

CWSH

avg

x

++×+

+×=

t 1

Q1

W1

C1

S1

Q2

W2

C2

S2

t 2

Under policy Y: first Q2 then Q1

1

1211

2

222 C

CCWS

C

CWSH

avg

y

++×+

+×=

For HX < HY:

avgavg C

S

CC

S

C 2

2

21

1

1

11×>×

t 1

Q1

W1

C1avg

S1

Q2

W2

C2avg

S2

t 2

C2C1

Page 20: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 20 Sharaf, Chrysanthis, Labrinidis, Pruhs

Scheduling for Slowdown (3)

Sx/Cxavg is the output rate (ORx) of Qx

Cx is the ideal processing time of a tuple produced by Qx

Our Highest Normalized Rate (HNR) policy emphasizes the tuple ideal processing time

Inexpensive queries with low productivity are not penalized

For equal costs: Ci = 1 HNR = HR

For selectivity 1: Si = 1 HNR = SRPT

avgx

x

x C

S

1Priority of Qx = x

x

ORC

×1

=

Page 21: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 21 Sharaf, Chrysanthis, Labrinidis, Pruhs

Utilization

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Avg. Slowdown

0

200

400

600

800

1000

1200

1400

1600 RRFCFSSRPTHRHNR

Avg.

Slo

wdow

n

Results: Average Slowdown

20%

Page 22: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 22 Sharaf, Chrysanthis, Labrinidis, Pruhs

Outline

Introduction

Scheduling for Quality of Service (QoS)

Average response time

Average slowdown

Balancing the trade-off between average and worst case

Implementation issues

Conclusions

Page 23: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 23 Sharaf, Chrysanthis, Labrinidis, Pruhs

Worst-Case Performance

Queries/Events may experience starvation Queries with low selectivity and/or high cost

Typically measured using: maximum response time, or maximum slowdown

Maximum slowdown (or response time) is: A very sensitive metric It does not consider the average-case performance

Page 24: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 24 Sharaf, Chrysanthis, Labrinidis, Pruhs

Trade-off between Avgerage Case and Worst Case

Maximum slowdown = worst-case performance Average slowdown = average-case performance We need to look at both metrics at the same time

Lp norm of slowdowns captures both metrics

L2 norm of N tuples =

it takes into account all values it penalizes outliers

∑N

iiSlowdown2

Page 25: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 25 Sharaf, Chrysanthis, Labrinidis, Pruhs

Scheduling for the L2 Norm of Slowdowns

Balance Slowdown Policy (BSD)

Priority of Qx =

A query is scheduled either because: It has a high normalized rate, or Its pending tuples accumulated high slowdown

All users are satisfied = Fairness

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

× x

x

xavgx

x

C

W

CC

S

Normalized Rate Current Slowdown

Page 26: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 26 Sharaf, Chrysanthis, Labrinidis, Pruhs

Scheduling Policy

HNR BSD

Avg. Slowdown

0

200

400

600

800

1000

1200

1400

Scheduling Policy

HNR BSD

Max. Slowdown

0

2e+4

4e+4

6e+4

8e+4

1e+5

Max. Slo

wdow

nResults: Balancing the trade-off

77%

31%

Avg. Slo

wdow

n

Page 27: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 27 Sharaf, Chrysanthis, Labrinidis, Pruhs

Utilization

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

L2 Norm of Slowdowns

0

1e+6

2e+6

3e+6

4e+6

5e+6RRFCFSSRPTHNRBSD

L 2 N

orm

of

Slo

wdow

ns

Results: L2 Norm of Slowdowns

24%

Page 28: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 28 Sharaf, Chrysanthis, Labrinidis, Pruhs

Slowdown per Class (same cost queries)

1

10

100

1000

10000

100000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Selectivity

Average Stretch per Class (log scale)

RR

SRPT

HR

HNR

BSD

Page 29: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 29 Sharaf, Chrysanthis, Labrinidis, Pruhs

Outline

Introduction

Scheduling for Quality of Service (QoS)

Implementation issues

Scheduling overhead

Shared operators (details in paper)

Conclusions

Page 30: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 30 Sharaf, Chrysanthis, Labrinidis, Pruhs

Implementation Technique

None +Clustering +PruningL 2 SD of BSD-Logarithmic / L

2 SD of BSD-Hypothetical

0%

1000%

2000%

3000%

4000%

5000%

6000%

7000%

+Clustered Processing

6470%

1550%

224% 105%

Optimization Methods

L 2 S

D o

f B

SD

-Logari

thm

ic /

L2 S

D o

f B

SD

-H

ypoth

eti

cal

•BSD-Hypothetical = BSD without overhead

Page 31: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 31 Sharaf, Chrysanthis, Labrinidis, Pruhs

Conclusions

In this talk, we presented:

QoS metrics for evaluating the performance of a DSMS

Scheduling policies that exploit the properties of CQs

Policies to improve QoS :

Highest Rate (HR) for average response time

Highest Normalized Rate (HNR) for average slowdown

Balance Slowdown (BSD) for balancing the trade-off between average- and worst-case performance

Addressed implementation issues to ensure the applicability of our proposed policies

We empirically evaluated the gains provided by the proposed policies compared to existing policies

Page 32: Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management

University of Pittsburgh 32 Sharaf, Chrysanthis, Labrinidis, Pruhs

Thank You

Questions ?

http://db.cs.pitt.edu/streams

Thanks: NSF IIS-0534531

(AQSIOS Project)