Download - Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

Stream Processing

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

March 30, 2005

2

Administrivia

Thursday, L101, 3PM: Muthian Sivathanu, U. Wisc., Semantically

Smart Disk Systems

Next readings: Monday – read and review the Madden paper Wednesday – read and summarize the Brin

and Page paper

3

Today’s Trivia Question

4

Data Stream Management

Basic idea: static queries, dynamic data

Applications: Publish-subscribe systems Stock tickers, news headlines Data acquisition, e.g., from sensors, traffic monitoring, …

The main two projects that are purely “stream processors”: Stanford STREAM MIT/Brown/Brandeis Aurora/Medusa

5

Summary from Last Time

Streams are time-varying data series STREAM maps them into timestamped sets (Aurora doesn’t seem to do this)

Most operations on streams resemble normal DB queries: Filtering, projection; grouping and aggregation; join

(Though the latter few are over windows)

STREAM started with an SQL-like language called CQL All stream operations go “through” relations Query plan operators have queues and synopses

6

Some Tricks for Performance

Sharing synopses across multiple operators In a few cases, more than one operator may join

with the same synopsis Can exploit punctuations or “k-constraints”

Analogous to interesting orders Referential integrity k-constraint: bound of k

between arrival of “many” element and its corresponding “one” element

Ordered-arrival k-constraint: need window of at most k to sort

Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

7

Query Processing – “Chain Scheduling”

Similar in many ways to eddies Combination of locally greedy and FIFO scheduling

Apply operator to data as follows: Assume we know how many tuples can be processed in

a time unit Cluster groups of operators into “chains” that maximize

reduction in queue size per unit time (i.e., most selective operators per time unit)

Greedily forward tuples into the most selective chain Within a chain, process the data in FIFO order

STREAM also does a form of join reordering

8

Scratching the Surface: Approximation

They point out two areas where we might need to approximate output: CPU is limited, and we need to drop some stream

elements according to some probabilistic metric Collect statistics via a profiler Use Hoeffding inequality to derive a sampling rate in order

to maintain a confidence interval This is generally termed load shedding

May need to do similar things if memory usage is a constraint

Are there other options? When might they be useful?

9

STREAM in General

“Logical semantics first”

Starts with a basic data model: streams as timestamped sets

Develops a language and semantics Heavily based on SQL

Proposes a relatively straightforward implementation Interesting ideas like k-constraints Interesting approaches like chain scheduling No real consideration of distributed processing

10

Aurora

“Implementation first; mix and match operations from past literature”

Basic philosophy: most of the ideas in streams existed in previous research Sliding windows, load shedding, approximation, … So let’s borrow those ideas and focus on how to

build a real system with them! Emphasis is on building a scalable, robust

system Distributed implementation: Medusa

11

Queries in Aurora

Oddly: no declarative query language!

Queries are workflows of physical query operators (SQuAl) Many operators resemble relational algebra ops

12

Example Query

13

Some Interesting Aspects

A relatively simple adaptive query optimizer Can push filtering and mapping into many operators Can reorder some operators (e.g., joins, unions)

Need built-in error handling If a data source fails to respond in a certain amount of

time, create a special alarm tuple This propagates through the query plan

Incorporate built-in load-shedding, RT sched. to support QoS

Have a notion of combining a query over historical data with data from a stream Switches from a pull-based mode (reading from disk) to

a push-based mode (reading from network)

14

The Medusa Processor

Distributed coordinator between many Aurora nodes Scalability through federation and distribution Fail-over Load balancing

15

Main Components

Lookup Distributed catalog – schemas, where to find

streams, where to find queries

Brain Query setup, load monitoring via I/O queues

and stats Load distribution and balancing scheme is

used Very reminiscent of Mariposa!

16

Load Balancing

Migration – an operator can be moved from one node to another Initial implementation didn’t support moving of state

The state is simply dropped, and operator processing resumes

Implications on semantics? Plans to support state migration

“Agoric system model to create incentives” Clients pay nodes for processing queries Nodes pay each other to handle load – pairwise contracts

negotiated offline Bounded-price mechanism – price for migration of load,

spec for what a node will take on Does this address the weaknesses of the Mariposa

model?

17

Some Applications They Tried

Financial services (stock ticker) Main issue is not volume, but problems with feeds Two-level alarm system, where higher-level alarm helps

diagnose problems Shared computation among queries User-defined aggregation and mapping

Linear road (sensor monitoring) Traffic sensors in a toll road – change toll depending on

how many cars are on the road Combination of historical and continuous queries

Environmental monitoring Sliding-window calculations

18

The Big Application?

Military battalion monitoring Positions & images of friends and foes Load shedding is important Randomly drop data vs. semantic, predicate-based

dropping to maintain QoS Based on a QoS utility function

19

Lessons Learned

Historical data is important – not just stream data (Summaries?)

Sometimes need synchronization for consistency “ACID for streams”?

Streams can be out of order, bursty “Stream cleaning”?

Adaptors and XML are important … But we already knew that!

Performance is critical They spent a great deal of time using microbenchmarks

and optimizing

20

Borealis

Aurora is now commercial Borealis follows up with some new

directions: Dynamic revision of results, i.e., corrections to

stream data Dynamic query modification – change on the fly

“Control lines”: change parameters “Time travel”: support execution of multiple queries,

starting from different points in time (past thru future) Distributed optimization

Combine stream and sensor processing ideas (we’ll talk about sensor nets next time)

Sensor-heavy vs. server-heavy optimization

21

Streams and Integration

How do streams and data integration relate?

Are streams the future, or just an interesting vista point on the side of the road?

Download - Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

Top Related