1 load shedding cs240b notes. 22 load shedding in a dsms zdsms: online response on boundless and...

20
1 Load Shedding CS240B notes

Upload: jerome-stephens

Post on 18-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

1

Load SheddingCS240B notes

Page 2: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

22

Load Shedding in a DSMS

DSMS: online response on boundless and bursty data streams—How?

By using approximations and synopses and even

Shedding load when arrival rates become impossible

Approximations and Synopses are often used with normal load

Shedding is used for bursty streams and overload situations.

Page 3: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

3

QoS and Load Schedding

When input stream rate exceeds system capacity a stream manager can shed load (tuples)

Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss

Introducing load shedding in a data stream manager is a challenging problem

Random load shedding or semantic load shedding

Page 4: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

4

Problems to Address

When to shed load Overload should be detected quickly

Where to shed load Avoid wasted work Upstream Drop Vs. Downstream Drop

How much to shed The magnitude of the drop

Which tuples to shed

Page 5: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

5

Loss-tolerance QoS function

Loss function is not linear:

Page 6: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

6

Value-based QoS

Value-based QoS Show which values of

the output tuple space are most important.

In a medical application that monitors patient heartbeats

Extreme values are certainly more interesting than normal ones

Corresponding higher utility

Page 7: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

7

Load Shedding in Aurora

QoS for each application as a function relating output to its utility

– Delay based, drop based, value basedTechniques for introducing load shedding

operators in a plan such that QoS isdisrupted the least

– Determining when, where and how much load to shed

Page 8: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

8

Which Query to drop First?

Models and algorithms proposed include Greedy algorithms or Fractional Knapsack Problem Other OR techniques Must deal with nonlinearities

Page 9: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

9

Load Shedding in STREAM

Formulate load shedding as an optimization problem for multiple sliding window aggregate queries

– Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate

Consider placement of load shedding operators in query plan

– Each operator sheds load uniformly with probability pi

Page 10: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

10

Window-Oriented Load Shedding

Input stream divided into windows of size wUse fewer Slides per windows to compute

aggregates—tumbles is the extreme case. Window-based Sampling

Reservoir sampling for incoming tuples Expiring tuples pose a more difficult problem.

Page 11: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

11

Load Shedding by Sampling for Continuous Aggregate Queries on Data Streams:

Only random samples are available for computing aggregate queries because of

Limitations of remote sensors, or transmission lines

Load Shedding policies implemented when overloads occur

When overloads occur (e.g., due to a burst of arrivals} we can

1. drop queries all together, or

2. sample the input---much preferable

Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows

Can we improve answer accuracy with minimal overhead?

Page 12: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

Load Shedding

To cope with bursty arrivals of high-volume data

DSMS has to shed load while minimizing the degradation of the Quality of Service (QoS)

The goal then becomes determining: when, where and how much load to shed

An intelligent scheme, can improve the quality of our mining results under bursty arrivals

12

Page 13: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

13

A first Architecture

Basic Idea: [BDM04] Optimize sampling rates of

load shedders for accurate answers.

Find an error bound for each aggregate query.

Determine sampling rates that minimize query inaccuracy within the limits imposed by resource constraints.

This approach works for SUM and COUNT

Generalization to other functions?

…...

…...

S1 Sn

Query N

etwork

∑ ∑ ∑

Aggregate

Query Operator

Load Shedder

Data Stream Si

Page 14: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

Query Network: arbitrary placement of aggregates and shedder after any

aggregate

S1

L1 L4

L2 L5

Q1 Q4

Q5Q3Q2

Sn

Data Stream

Load Shedder

Aggregate Operator

14

Page 15: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

Generalized Load Shedding in Stream Mill

1. A general framework that achieves optimal load shedding policies, while accommodating: Different requirements for different users, different query

sensitivities, and different penalties.

2. Applicability to a wide spectrum of aggregate functions: We have formally characterized using a new notion, called

reciprocal-error queries.

3. Proposing an extensible architecture that allows UDAs to benefit from the system provided load shedding functions.

4. Significant improvements (in absolute error, false positives, and false negatives) compared to the common uniform approach.

5. We propose an efficient (linear-time) algorithm to handle severe overloads without losing optimality.

15

Page 16: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

16

Goals to Achieve

Light-weight overhead handling React to overload immediately

Minimizing QoS degradationDelivering subset results

Only omitting tuples from the correct answer

Never produce incorrect answers

Page 17: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

17

Prediction & Improvements

A larger class of queries was considered in [LZ08] SUM, COUNT, AVG, Quantiles.

Temporal Correlation between answers can be used to improve answer Example: sensor data Current answer can be adjusted by the past answers so

that: Low sampling rate current answer less accurate more

dependent on history. High sampling rate current answer more accurate less

dependent on history.

A Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers.

Page 18: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

18

Improved Model Using History

The observed answer à is computed from random samples of the complete stream with sampling rate P.

A bayesian method to obtain the improved answer by combining the observed answer the error model history of the

answerAggregate

Quality Enhancement Module

Improved answer

…...

…...∑ ∑ ∑

S1

Sn

Query N

etwork

History

P

Ã

Query Operator

Load Shedder

Data Stream Si

…...

Page 19: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

19

Summary

An error model Works for ordered statistics and data mining

functions as well as with traditional aggregates, computationally very efficient Bayesian quality enhancement method for

approximate aggregates in the presence of sampling.

No correction when concept changes are suspected—a two-sample test used to detect suspected changes.

Page 20: 1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and

20

References—Sampling and load shedding

[Tabul03] Nesime Tatbul, Ugur Cetintemel, Stanley B. Zdonik, Mitch Cherniack, Michael Stonebraker: Load Shedding in a Data Stream Manager.VLDB2003, pp.309--320.

[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361.

[Tabul07] Nesime Tatbul, Ugur Cetintemel, Stanley B. Zdonik: Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing. VLDB 2007: 159-170.

[LZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008.

[ICDE 2010] Barzan Mozafari and Carlo Zaniolo, Optimal Load Shedding with Aggregates and Mining Queries. In Proceedings of the 26th International Conference on Data Engineering (ICDE 2010), Long Beach, California, USA, March 1-6, 2010.