stream processing zachary g. ives university of pennsylvania cis 650 – database & information...

21
Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

Post on 20-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

Stream Processing

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

March 30, 2005

Page 2: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

2

Administrivia

Thursday, L101, 3PM: Muthian Sivathanu, U. Wisc., Semantically

Smart Disk Systems

Next readings: Monday – read and review the Madden paper Wednesday – read and summarize the Brin

and Page paper

Page 3: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

3

Today’s Trivia Question

Page 4: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

4

Data Stream Management

Basic idea: static queries, dynamic data

Applications: Publish-subscribe systems Stock tickers, news headlines Data acquisition, e.g., from sensors, traffic monitoring, …

The main two projects that are purely “stream processors”: Stanford STREAM MIT/Brown/Brandeis Aurora/Medusa

Page 5: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

5

Summary from Last Time

Streams are time-varying data series STREAM maps them into timestamped sets (Aurora doesn’t seem to do this)

Most operations on streams resemble normal DB queries: Filtering, projection; grouping and aggregation; join

(Though the latter few are over windows)

STREAM started with an SQL-like language called CQL All stream operations go “through” relations Query plan operators have queues and synopses

Page 6: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

6

Some Tricks for Performance

Sharing synopses across multiple operators In a few cases, more than one operator may join

with the same synopsis Can exploit punctuations or “k-constraints”

Analogous to interesting orders Referential integrity k-constraint: bound of k

between arrival of “many” element and its corresponding “one” element

Ordered-arrival k-constraint: need window of at most k to sort

Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

Page 7: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

7

Query Processing – “Chain Scheduling”

Similar in many ways to eddies Combination of locally greedy and FIFO scheduling

Apply operator to data as follows: Assume we know how many tuples can be processed in

a time unit Cluster groups of operators into “chains” that maximize

reduction in queue size per unit time (i.e., most selective operators per time unit)

Greedily forward tuples into the most selective chain Within a chain, process the data in FIFO order

STREAM also does a form of join reordering

Page 8: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

8

Scratching the Surface: Approximation

They point out two areas where we might need to approximate output: CPU is limited, and we need to drop some stream

elements according to some probabilistic metric Collect statistics via a profiler Use Hoeffding inequality to derive a sampling rate in order

to maintain a confidence interval This is generally termed load shedding

May need to do similar things if memory usage is a constraint

Are there other options? When might they be useful?

Page 9: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

9

STREAM in General

“Logical semantics first”

Starts with a basic data model: streams as timestamped sets

Develops a language and semantics Heavily based on SQL

Proposes a relatively straightforward implementation Interesting ideas like k-constraints Interesting approaches like chain scheduling No real consideration of distributed processing

Page 10: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

10

Aurora

“Implementation first; mix and match operations from past literature”

Basic philosophy: most of the ideas in streams existed in previous research Sliding windows, load shedding, approximation, … So let’s borrow those ideas and focus on how to

build a real system with them! Emphasis is on building a scalable, robust

system Distributed implementation: Medusa

Page 11: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

11

Queries in Aurora

Oddly: no declarative query language!

Queries are workflows of physical query operators (SQuAl) Many operators resemble relational algebra ops

Page 12: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

12

Example Query

Page 13: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

13

Some Interesting Aspects

A relatively simple adaptive query optimizer Can push filtering and mapping into many operators Can reorder some operators (e.g., joins, unions)

Need built-in error handling If a data source fails to respond in a certain amount of

time, create a special alarm tuple This propagates through the query plan

Incorporate built-in load-shedding, RT sched. to support QoS

Have a notion of combining a query over historical data with data from a stream Switches from a pull-based mode (reading from disk) to

a push-based mode (reading from network)

Page 14: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

14

The Medusa Processor

Distributed coordinator between many Aurora nodes Scalability through federation and distribution Fail-over Load balancing

Page 15: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

15

Main Components

Lookup Distributed catalog – schemas, where to find

streams, where to find queries

Brain Query setup, load monitoring via I/O queues

and stats Load distribution and balancing scheme is

used Very reminiscent of Mariposa!

Page 16: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

16

Load Balancing

Migration – an operator can be moved from one node to another Initial implementation didn’t support moving of state

The state is simply dropped, and operator processing resumes

Implications on semantics? Plans to support state migration

“Agoric system model to create incentives” Clients pay nodes for processing queries Nodes pay each other to handle load – pairwise contracts

negotiated offline Bounded-price mechanism – price for migration of load,

spec for what a node will take on Does this address the weaknesses of the Mariposa

model?

Page 17: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

17

Some Applications They Tried

Financial services (stock ticker) Main issue is not volume, but problems with feeds Two-level alarm system, where higher-level alarm helps

diagnose problems Shared computation among queries User-defined aggregation and mapping

Linear road (sensor monitoring) Traffic sensors in a toll road – change toll depending on

how many cars are on the road Combination of historical and continuous queries

Environmental monitoring Sliding-window calculations

Page 18: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

18

The Big Application?

Military battalion monitoring Positions & images of friends and foes Load shedding is important Randomly drop data vs. semantic, predicate-based

dropping to maintain QoS Based on a QoS utility function

Page 19: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

19

Lessons Learned

Historical data is important – not just stream data (Summaries?)

Sometimes need synchronization for consistency “ACID for streams”?

Streams can be out of order, bursty “Stream cleaning”?

Adaptors and XML are important … But we already knew that!

Performance is critical They spent a great deal of time using microbenchmarks

and optimizing

Page 20: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

20

Borealis

Aurora is now commercial Borealis follows up with some new

directions: Dynamic revision of results, i.e., corrections to

stream data Dynamic query modification – change on the fly

“Control lines”: change parameters “Time travel”: support execution of multiple queries,

starting from different points in time (past thru future) Distributed optimization

Combine stream and sensor processing ideas (we’ll talk about sensor nets next time)

Sensor-heavy vs. server-heavy optimization

Page 21: Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

21

Streams and Integration

How do streams and data integration relate?

Are streams the future, or just an interesting vista point on the side of the road?