eddies: continuously adaptive query processing r. avnur, j.m. hellerstein ucb acm sigmod 2000

59
Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Upload: allen-reed

Post on 31-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Eddies: Continuously Adaptive Query processing

R. Avnur, J.M. Hellerstein

UCB

ACM Sigmod 2000

Page 2: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Problem Statement

• Context: large federated and shared-nothing databases

• Problem: assumptions made at query optimization rarely hold during execution

• Hypothesis: do away with traditional optimizers, solve it thru adaptation

• Focus: scheduling in a tuple-based pipeline query execution model

Page 3: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Problem Statement Refinement

• Large scale systems are unpredictable, because

– Hardware and workload complexity,

• bursty servers & networks, heterogenity, hardware characteristics

– Data complexity,

• Federated database often come without proper statistical summaries

– User Interface Complexity

• Online aggregation may involve user ‘control’

Page 4: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Research Laboratory setting

• Telegraph, a system designed to query all data available online

• River, a low level distributed record management system for shared-nothing databases

• Eddies, a scheduler for dispatching work over operators in a query graph

Page 5: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The Idea

• Relational algebra operators consume a stream from multiple sources to produce a new stream

• A priori you don’t now how selective- how fast- tuples are consumed/produced

• You have to adapt continuously and learn this information on the fly

• Adapt the order of processing based on these lessons

Page 6: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The Idea

JOIN JOIN

JOIN

next

next next

next

next next

Page 7: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The Idea

• Standard method: derive a spanning tree over the query graph

• Pre-optimize a query plan to determine operator pairs and their algorithm, e.g. to exploit access paths

• Re-optimization a query pipeline on the fly requires careful state management, coupled with

– Synchronization barriers

• Operators have widely differing arrival rates for their operands– This limits concurrency, e.g. merge-join algorithm

– Moments of symmetry

• Algorithm provides option to exchange the role of the operands without too much complications

– E.g switching the role of R and S in a nested-loop join

Page 8: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Nested-loop

R

s

Page 9: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Join and sorting

• Index-joins are asymmetric, you can not easily change their role

– Combine index-join + operands as a unit in the process

• Sorting requires look-ahead

– Merge-joins are combined into unit

• Ripple joins

– Break the space into smaller pieces and solve the join operation for each piece individually

– The piece crossings are moments of symmetry

Page 10: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The Idea

Tuple buffer JOIN

JOIN JOIN

Eddienext next next next

next next next

next

Page 11: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Rivers and Eddies

Eddies are tuple routers that distribute arriving tuples to interested operators

– What are efficient scheduling policies?

• Fixed strategy? Random ? Learning?

Static Eddies• Delivery of tuples to operators can be hardwired in the Eddie to reflect

a traditional query execution plan

Naïve Eddie

• Operators are delivered tuples based on a priority queue

• Intermediate results get highest priority to avoid buffer congestion

Page 12: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Observations for selections

• Extended priority queue for the operators– Receiving a tuple leads to a credit increment

– Returning a tuple leads to a credit decrement

– Priority is determined by “weighted lottery”

• Naïve Eddies exhibit back pressure in the tuple flow; production is limited by the rate of consumption at the output

• Lottery Eddies approach the cost of optimal ordering, without a need to a priory determine the order

• Lottery Eddies outperform heuristics

– Hash-use first, or Index-use first, Naive

Page 13: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Observations

• The dynamics during a run can be controlled by a learning scheme

– Split the processing in steps (‘windows’) to re-adjust the weight during tuple delivery

• Initial delays can not be handled efficiently

• Research challenges:

– Better learning algorithms to adjust flow

– Aggressive adjustments

– Remove pre-optimization

– Balance ‘hostile’ parallel environment

– Deploy eddies to control degree of partitioning (and replication)

Page 14: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Database streams: You only get one chance to look

Prof. Dr. Martin Kersten

CWI

Amsterdam

March 2003

Page 15: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The tranquil database scene

• Traditional DBMS – data stored in finite, persistent data sets, SQL-based applications to manage and access it

OLTP-webapplication

‘Ad-hoc’reporting

RDBMS

Data entryapplication

Page 16: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The tranquil database scene

• The user community grows and MANY wants up-to-the-second (aggregate) information from the database

OLTP-webapplication

‘Ad-hoc’reporting

RDBMS

Data entryapplication

Informedreporting

Page 17: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The tranquil database scene

• Database entry is taken over by a remote device which issues a high-volume of update transactions

OLTP-webapplication

‘Ad-hoc’reporting

RDBMS

Dataentryapplication

Data entryapplication

Informedreporting

Page 18: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The tranquil database scene

• Database entry is taken over by MANY remote devices which issues a high-volume of update transactions

OLTP-webapplication

‘Adhoc’reporting

RDBMS

Dataentryapplication

Dataentryapplication

Informedreporting

Page 19: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The tranquil database scene

• Database solutions can not carry the weight

OLTP-webapplication

‘Adhoc’reporting

RDBMS

Dataentryapplication

Dataentryapplication

Informedreporting

Page 20: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Application domains

• Personalized financial tickers

• Personalized information delivery

• Personalized environment control

• Business to business middelware

• Web-services application based on XML exchange

• Monitoring the real-world environment (pollution, traffic)

• Monitoring the data flow in an ISP

• Monitoring web-traffic behaviour

• Monitoring the load on a telecom switch

• Monitoring external news-feeds

Page 21: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Application domains

• Personalized financial tickers

• Personalized information delivery

• Personalized environment control

• Business to business middelware

• Web-services application based on XML exchange

• Monitoring the real-world environment (pollution, traffic)

• Monitoring the data flow in an ISP

• Monitoring web-traffic behaviour

• Monitoring the load on a telecom switch

• Monitoring external news-feeds

Page 22: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Application domains

• Personalized

• Personalized

• Personalized

• middelware

• on XML exchange

• Monitoring

• Monitoring

• Monitoring

• Monitoring

• Monitoring

QUERYING

STREAM UPDATE

WEB SERVICES

Page 23: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Continuous queries

• Continous query – the user observes the changes made to the database through a query

– Query registration once

– Continously up-to-date answers.Continuousqueries

RDBMS

Page 24: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Data Streams

• Data streams

– The database is in constant bulk load mode

– The update rate is often non-uniform

– The entries are time-stamped

– The source could be web-service, sensor, wrapped source

DSMS

Dataentryapplication

Page 25: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

DSMS

Data Stream Management Systems (DSMS) support high volume update streams and real-time response to ad-hoc complex queries.

What can be salvaged from the DBMS core technology ?What should be re-designed from scratch ?

DSMS

Dataentryapplication

Informedreporting

Page 26: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

DBMS versus DSMS

• Persistent relations

• Transaction oriented

• One-time queries

• Precise query answering

• Access plan determines physical database design

• Transient streams

• Query orientation

• Continuous queries

• Best-effort query answering

• Unpredictable data characteristics

Page 27: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Old technology to rescue?

• Many stream based applications are low-volume with simple queries

– Thus we can live with automatic query ‘refresh’

• Triggers are available for notification of changes

– They are hooked up to simple changes to the datastore

– There is no technology to merge/optimize trigger groups

Page 28: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Outline of remainder

• Query processing over multiple streams

• Organizing hundreds of ad-hoc queries

• Sensor-network based querying

DSMS

DSMS

DSMS

Page 29: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

A stream application

• [Widom] Consider a network traffic system for an ISP

• with customer link and backbone link and two streams

• keeping track of the IP traffic

Page 30: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

A stream application

• [Widom] Consider a network traffic system for an ISP

• with customer link and backbone link and two streams

• keeping track of the IP traffic

TPc(saddr, daddr, id, length, timestamp)

TPb(saddr, daddr, id, length, timestamp)

PTc

PTb

DSMS

Page 31: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

A stream application

• Q1 Compute the load on the backbone link averaged over one minute period and notify the operator when the load exceeds a threshold T

Select notifyoperator(sum(length))From PTbGroup By getminute(timestamp)Having sum(length) >T

With low stream flow it could be handled with a DBMS trigger,Otherwise sample the stream to get an approximate answer

Page 32: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

A stream application

• Q2 Find the fraction of traffic on the backbone link coming from the customer network to check cause of congestion.

( Select count(*) From PTc as C, PTb as B Where C.saddr = B.saddr and C.daddr=B.daddr and C.id=B.id ) /( Select count(*) From PTb)

Both streams might require an unbounded resource to perform the join, which could be avoided with an approximate answer and synopsis

Page 33: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

A stream application

• Q3 Monitor the 5% source-to-destination pairs in terms of traffic on the backbone.

With Load As (Select saddr, daddr,sum(length) as traffic From PTb Group By saddr,daddr)

Select saddr, daddr, trafficFrom Load as l1Where (Select count(*) From Load as l2

Where l2.traffic <l1.traffic) > (Select 0.95*count(*) From Load)

Order By Traffic This query contains ‘blocking’ operators

Page 34: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

STREAM architecture

DSMSTPc

TPb

Answer AnswerStore

ScratchArea

Trash

Page 35: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The answer store area simply needs an integer

• Q1 Compute the load on the backbone link averaged over one minute period and notify the operator when the load exceeds a threshold T

Select notifyoperator(sum(length))

From PTb

Group By getminute(timestamp)

Having sum(length) >T

Page 36: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

The scratch area should maintain part of the two streams to implement the join. Or a complete list of saddr and daddr.

• Q2 Find the fraction of traffic on the backbone link coming from the customer network to check cause of congestion.

( Select count(*)

From PTc as C, PTb as B

Where C.saddr = B.saddr and C.daddr=B.daddr

and C.id=B.id ) /

( Select count(*) From PTb)

Page 37: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Joining two tables

RelA

RelB

Nested loop join

Page 38: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Joining two tables

RelA

RelB

Nested loop join

Page 39: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Joining two stream

……..

PTb

Nested loop join

PTa

……..

An unbounded store would be required

Page 40: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Joining two stream

……..

PTb

merge join

PTa

……..

If the streams are ordered a simple merge join is possibleWith limited resource requirements

Page 41: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Joining two stream

……..

PTb

Join synopsis

PTa

……..

A statistical summary could provide an approximate answer

histogram

histogram

window

Page 42: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

• Q3 Monitor the 5% source-to-destination pairs in terms of traffic on the backbone.

With Load As (Select saddr, daddr,sum(length) as traffic From PTb Group By saddr,daddr)

Select saddr, daddr, trafficFrom Load as l1Where (Select count(*) From Load as l2

Where l2.traffic <l1.traffic) > (Select 0.95*count(*) From Load)

Order By Traffic The scratch area should maintain part of the two streams to implement the join.

Page 43: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• [DeWitt] Consider a financial feed where thousands of clients can register arbitrary complex continues queries.– XML stream querying

DSMSXML

Page 44: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Q5 Notify me whenever the price of KPN stock drops below 6 euro

Select notifyUser(name, price)

From ticker t1

Where t1.name = “KPN” and t1.price < 6

Page 45: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Q5 Notify me whenever the price of KPN stock drops by 5% over the last hour

Select notifyUser(name, price)

From ticker t1,t2

Where t1.name = “KPN” and t2.name= t1.name

and getminutes(t1.timestamp-t2.timestamp) <60

and t1.price < 0.95 * t2.price

Page 46: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Q6 Notify me whenever the price of KPN stock drops by 5% over the last hour and T-mobile remains constant

Select notifyUser(name, price)

From ticker t1,t2, t3,t4

Where t1.name = “KPN” and t2.name= t1.name

and getminutes(t1.timestamp-t2.timestamp) <60

and t1.price < 0.95 * t2.price

and t1.timestamp=t3.timestamp and t2.timestamp=t4.timestamp

and t3.name = “T-Mobile” and t4.name= t3.name

and getminutes(t3.timestamp-t4.timestamp) <60

and t3.price = t4.price

Page 47: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Query signatures

• Traditional SQL applications already use the notion of parameterised queries, I.e. some constants are replaced by a program variable.

– Subsequent calls use the same query evaluation plan

• In a DSMS we should recognize such queries as quick as possible

– Organize similar queries into a group

– Decompose complex queries into smaller queries

– Manage the amount of intermediate store

Page 48: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Queries can be organized in groups using a signature and evaluation can be replaced by single multi-user request.

Select notifyUser(name, price)

From ticker t1

Where t1.name = “KPN” and t1.price < 6

Client Name Threshold Price

192.871.12.1 KPN 6

192.777.021 ING 12

Page 49: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Queries can be organized in groups using a signature and evalution can be replaced by single multi-user request.

Select notifyUser(c.client, t1.name, t1.price)

From ticker t1, clients c

Where t1.name = c.name and t1.price < c.price

Client Name Threshold Price

192.871.12.1 KPN 6

192.777.021 ING 12

Page 50: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Timer-based queries call for a stream window with incremental evaluation

• Multiple requests can be organized by time-table and event detection methods provided by database triggers.

Select notifyUser(name, price)

From ticker t1,t2

Where t1.name = “KPN” and t2.name= t1.name

and getminutes(t1.timestamp-t2.timestamp) <60

and t1.price < 0.95 * t2.price

Page 51: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Complex queries can be broken down into independent components

Select notifyUser(name, price)

From ticker t1,t2, t3,t4

Where t1.name = “KPN” and t2.name= t1.name

and getminutes(t1.timestamp-t2.timestamp) <60

and t1.price < 0.95 * t2.price

and t1.timestamp=t3.timestamp and t2.timestamp=t4.timestamp

and t3.name = “T-Mobile” and t4.name= t3.name

and getminutes(t3.timestamp-t4.timestamp) <60

and t3.price = t4.price

Page 52: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Finance

• Intermediate results should be materialized. Can be integrated in tradition query evaluation schemes

t1.timestamp=t3.timestamp and t2.timestamp=t4.timestamp

Page 53: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Sensor networks

• [Madden] Sensor networks are composed of thousands of small devices, interconnected through radio links. This network can be queried.

– Sensors have limited energy

– Sensors have limited reachability

– Sensors can be ‘crushed’

DSMS

Page 54: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Aggregate Queries Over Ad-Hoc Wireless Sensor Networks

Page 55: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Sensor networks

• Q7 Give me the traffic density on the A1 for the last hour

Select avg(t.car)

From traffic t

Where t.segment in (Select segment From roads

Where name = “A1”)

Group By gethour(t.timestamp)

Page 56: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Sensor networks

• The sensors should organize themselves into a P2P infrastructure

• An aggregate query is broadcasted through the network

• Each Mote calculates a partial answer and sent it to its peers

• Peers aggregate the information to produce the final answer.

• Problems

– The energy to broadcast some information is high

– Tuples and partial results may be dropped

Page 57: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Conclusions and outlook

• Data stream management technology require changes in our expectation of a DBMS functionality

– Queries not necessarily provide a precise answer

– Queries continue as long as we are interested in their approximate result

– The persistent store not necessarily contains a consistent and timeless view on the state of the database.

Page 58: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Conclusions and outlook

• Datastream management technology capitalizes upon proven DBMS technology

• DSMS provide a basis for ambient home settings, sensor networks, and globe spanning information systems

• It is realistic to expect that some of the properties to support efficient datastream management will become part of the major products

– Multi query optimization techniques should be added.

Page 59: Eddies: Continuously Adaptive Query processing R. Avnur, J.M. Hellerstein UCB ACM Sigmod 2000

Literature

• NiagaraCQ: A Scalable Contious Query System for Internet Databases, J. Chen, D.J. deWitt, F. Tian, Y. Wang, Wisconsin Univ.

• Streaming Queries over Streaming Data , Sirish Chandrasekaran, Michael J. Franklin, Univ Berkeley

• Continous Queries over Data Streams, S.Babu, J. Widom, Stanford University