continuous analytics over discontinuous streams sailesh krishnamurthy, michael franklin, jeff davis,...
TRANSCRIPT
Continuous Analytics Over Discontinuous Streams
Sailesh Krishnamurthy, Michael Franklin,
Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre
June 10, 2010SIGMOD, Indianapolis
• Founded in 2005
• Roots in TelegraphCQ project from UC Berkeley
• HQ in Foster CIty, CA
• Focus on “Continuous Analytics”
• Fortune 100 and web-based Big Data Customers
3
Data Records / “Events”
Update Display
Real-TimeAnalysis
CQ Processor
Source Data
Stream Query Processing (Traditional View)
4
SQL Execution On Streaming Data
• A stream is an unbounded sequence of records
• A table is a set of records
• Window operators convert streams to tables
• SQL queries apply to tables
Window Operator
• Each window produces a set of records (a table)• Semantics:
• Repeatedly apply generic SQL to the results of window operators
• Results are continuously appended to the output stream
5
Example: SQL Queries over Streams
SELECT I.Advertiser, SUM(I.price*I.volume)FROM Impressions I <VISIBLE ‘5 sec’ ADVANCE ‘3 sec’>, Campaigns CWHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’GROUP BY I.Advertiser
“I want to look at 5 seconds worth of impressions”
“I want results every 3 seconds”
Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window”
Result(s)
Impression Data Stream
Result(s)
…
Window
Window Operator Clause
Assumptions About Streams
6
Continuous sequences
Arriving mostly in order
467 5 38 1, 2
The Reality
7
6
9
10 5
3
3
5
4 2
94 3
2
4
Minutes, Hours, Days, late arriving Data
Multiple streams out of sync, with gaps, …
1, 5, ?
Traditional (in Order) Solution #1: “Slack”
8
1 1 1 2 2 1,2 3 3 1,2,3 4 2 1,2,2,3 5 6 6 1,2,2,3 6 5 5,6 7 1 5,6 8 9 9 5,6 9 8 8,9
Time Stamp
3-Second Slack Buffer
OUTPUTTuple #
Slack
9
• Pros• Simple• Handles “jitter” (slightly out of
order arrival)
• Cons• Introduces delay• Permanently drops arrivals later than buffer• Unbounded buffer size• Permanently drops arrivals if lulls in multiple
input streams
Traditional (in Order) Solution #2: “Drift”
10
(A,1) (a,2) (A,1)(B,2) (b,3) (a,2), (B,2)(C,3) (c,4) (b,3), (C,3)(G,4) (d,5) (c,4), (G,4)(D,6) (d,5)(E,7) (D,6),(E,7)(R,8) (E,7),(R,8) (D,6)(F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9)
Source2
2-Second Drift Buffer
OUTPUTSource 1
Drift
11
• Pros• Simple• Handles multiple streams with
short “lulls” in arrival
• Cons• Doesn’t handle streams with dramatically
different arrival rates• Permanently drops data that arrives after drift
window has expired
Traditional Solution #3: Order-agnostic Operators
12
• Slack and Drift aim to order streams before presenting them to order-sensitive operators
• Many operators don’t care about order
SELECT count(*), cq_close(*) tsFROM S <slices ‘5 seconds’>
Out of Order Processing: Count Example
13
1 1 1 2 3 2 3 2 3 4 4 4 5 5 (4,t=5) 6 6 1 7 2 1 8 9 2 9 7 3 10 3 3 11 10 (3,t=10)
Time Stamp
CountState
OUTPUTTuple
#Heart-Beat
Order-agnostic Operators
14
• Pros• No buffering• No extra delays• Handles out-of-order tuples that
make it before heart-beat
• Cons• Some operators do care about order• Permanently drops data that arrives after
heartbeat• Note: Lost data also impacts bigger “roll up
queries” e.g. <slices 15 seconds> with sharing
So, how to handle very late data and discontinuous streams?
15
16
Integration Framework
Shared Stream Query Processor
Persistent Data Store
SQL Interface
Raw Data Aggregates
“Stream-Relational” Architecture [CIDR 09]
JDBC / JMS XML Flat files ETL tools SOAP APIs
Data Warehouse
App Logic / UDFs
Other TrucQ’s
17
Order-Independent Processing: Overview
• Answers that have already been delivered can only be compensated
• Need to preserve all arriving data • Queries return answers based on
all relevant data that has arrived:• CQ’s: Continuous Queries• SQ’s: SQL queries on archived streams & answers
• Approach: Leverage benefits of SQL(!):• Data-Parallel processing w/on-demand consolidation• Powerful “View” mechanisms
• Basically, create parallel partitions for late data• Rewrite queries as views over partial results
Out of Order Processing: Count Example
18
1 1 1 2 3 2 3 2 3 4 4 4 5 2 5 6 1 6 7 5 (6,t=5) 8 6 1 9 2 1 1 10 9 2 1 11 7 3 1
DataTS Control
Count State Partitions
OUTPUTTuple
#
Out of Order Processing: Count Example
19
11 7 3 1 12 3 3 2 13 10 2 (3,t=10) 14 12 1 2 15 8 1 1 (2,t=5) 16 4 1 1 17 3 1 2 18 9 2 2 19 15 2 2 (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5)
DataTS Control
Count State Partitions
OUTPUTTuple # (6,t=5)
Out of Order Processing: Count Example
20
(6,t=5)(3,t=10) (2,t=5)(1,t=15)(2,t=10)(2,t=5)
OUTPUT• Treat output as “Partial State Records”• Rewrite queries using views over PSRs
• i.e., consolidate On-Demand• Paper goes into substantial detail
on how rewrites work• <Slices 5 second>
• Same answer as Order-Insensitive• <Slices 15 second> as roll-up
• Answer contains all data• Subsequent SQs over archived results
and raw data contain all data too!
Handles Very Late Data, Plus You Get…
21
• Parallel Processing – Multicore and Cluster
U
U
D
D
D
D
D
Client
Client
Client
ClientH
igh
-ba
nd
wid
th N
etw
ork
In
terc
on
ne
ct
D = Distributed Processing NodeU = Unified Processing Node
Other Details in the Paper
22
• Beyond late data and parallelism, approach also is key to supporting:• Fault Tolerance using replication• High-Availability via fast restart• “Nostalgic” continuous queries that start in the
past and catch up to the present• Fast concurrent creation of archives for new CQs
• Algorithmic/Systems details on• Integration with overall system architecture• Interaction with Transaction Mechanism• Need for Background Reducer task• Hybrid Plans for non-parallelizable parts of queries
Conclusions
23
• Early Stream Processing Systems were based on simplistic assumptions about ordering
• Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped
• Approach leverages strengths of SQL• Data-parallel processing models• Sophisticated and efficient view functionality
• Key is On-Demand Consolidation• Of course, you can only do it if you have an
integrated stream-relational system
For more info: [email protected] or [email protected]