Stream Processing
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Database & Information Systems
March 30, 2005
2
Administrivia
Thursday, L101, 3PM: Muthian Sivathanu, U. Wisc., Semantically
Smart Disk Systems
Next readings: Monday – read and review the Madden paper Wednesday – read and summarize the Brin
and Page paper
3
Today’s Trivia Question
4
Data Stream Management
Basic idea: static queries, dynamic data
Applications: Publish-subscribe systems Stock tickers, news headlines Data acquisition, e.g., from sensors, traffic monitoring, …
The main two projects that are purely “stream processors”: Stanford STREAM MIT/Brown/Brandeis Aurora/Medusa
5
Summary from Last Time
Streams are time-varying data series STREAM maps them into timestamped sets (Aurora doesn’t seem to do this)
Most operations on streams resemble normal DB queries: Filtering, projection; grouping and aggregation; join
(Though the latter few are over windows)
STREAM started with an SQL-like language called CQL All stream operations go “through” relations Query plan operators have queues and synopses
6
Some Tricks for Performance
Sharing synopses across multiple operators In a few cases, more than one operator may join
with the same synopsis Can exploit punctuations or “k-constraints”
Analogous to interesting orders Referential integrity k-constraint: bound of k
between arrival of “many” element and its corresponding “one” element
Ordered-arrival k-constraint: need window of at most k to sort
Clustered-arrival k-constraint: bound on distance between items with same grouping attributes
7
Query Processing – “Chain Scheduling”
Similar in many ways to eddies Combination of locally greedy and FIFO scheduling
Apply operator to data as follows: Assume we know how many tuples can be processed in
a time unit Cluster groups of operators into “chains” that maximize
reduction in queue size per unit time (i.e., most selective operators per time unit)
Greedily forward tuples into the most selective chain Within a chain, process the data in FIFO order
STREAM also does a form of join reordering
8
Scratching the Surface: Approximation
They point out two areas where we might need to approximate output: CPU is limited, and we need to drop some stream
elements according to some probabilistic metric Collect statistics via a profiler Use Hoeffding inequality to derive a sampling rate in order
to maintain a confidence interval This is generally termed load shedding
May need to do similar things if memory usage is a constraint
Are there other options? When might they be useful?
9
STREAM in General
“Logical semantics first”
Starts with a basic data model: streams as timestamped sets
Develops a language and semantics Heavily based on SQL
Proposes a relatively straightforward implementation Interesting ideas like k-constraints Interesting approaches like chain scheduling No real consideration of distributed processing
10
Aurora
“Implementation first; mix and match operations from past literature”
Basic philosophy: most of the ideas in streams existed in previous research Sliding windows, load shedding, approximation, … So let’s borrow those ideas and focus on how to
build a real system with them! Emphasis is on building a scalable, robust
system Distributed implementation: Medusa
11
Queries in Aurora
Oddly: no declarative query language!
Queries are workflows of physical query operators (SQuAl) Many operators resemble relational algebra ops
12
Example Query
13
Some Interesting Aspects
A relatively simple adaptive query optimizer Can push filtering and mapping into many operators Can reorder some operators (e.g., joins, unions)
Need built-in error handling If a data source fails to respond in a certain amount of
time, create a special alarm tuple This propagates through the query plan
Incorporate built-in load-shedding, RT sched. to support QoS
Have a notion of combining a query over historical data with data from a stream Switches from a pull-based mode (reading from disk) to
a push-based mode (reading from network)
14
The Medusa Processor
Distributed coordinator between many Aurora nodes Scalability through federation and distribution Fail-over Load balancing
15
Main Components
Lookup Distributed catalog – schemas, where to find
streams, where to find queries
Brain Query setup, load monitoring via I/O queues
and stats Load distribution and balancing scheme is
used Very reminiscent of Mariposa!
16
Load Balancing
Migration – an operator can be moved from one node to another Initial implementation didn’t support moving of state
The state is simply dropped, and operator processing resumes
Implications on semantics? Plans to support state migration
“Agoric system model to create incentives” Clients pay nodes for processing queries Nodes pay each other to handle load – pairwise contracts
negotiated offline Bounded-price mechanism – price for migration of load,
spec for what a node will take on Does this address the weaknesses of the Mariposa
model?
17
Some Applications They Tried
Financial services (stock ticker) Main issue is not volume, but problems with feeds Two-level alarm system, where higher-level alarm helps
diagnose problems Shared computation among queries User-defined aggregation and mapping
Linear road (sensor monitoring) Traffic sensors in a toll road – change toll depending on
how many cars are on the road Combination of historical and continuous queries
Environmental monitoring Sliding-window calculations
18
The Big Application?
Military battalion monitoring Positions & images of friends and foes Load shedding is important Randomly drop data vs. semantic, predicate-based
dropping to maintain QoS Based on a QoS utility function
19
Lessons Learned
Historical data is important – not just stream data (Summaries?)
Sometimes need synchronization for consistency “ACID for streams”?
Streams can be out of order, bursty “Stream cleaning”?
Adaptors and XML are important … But we already knew that!
Performance is critical They spent a great deal of time using microbenchmarks
and optimizing
20
Borealis
Aurora is now commercial Borealis follows up with some new
directions: Dynamic revision of results, i.e., corrections to
stream data Dynamic query modification – change on the fly
“Control lines”: change parameters “Time travel”: support execution of multiple queries,
starting from different points in time (past thru future) Distributed optimization
Combine stream and sensor processing ideas (we’ll talk about sensor nets next time)
Sensor-heavy vs. server-heavy optimization
21
Streams and Integration
How do streams and data integration relate?
Are streams the future, or just an interesting vista point on the side of the road?