continuous analytics over discontinuous streams sailesh krishnamurthy, michael franklin, jeff davis,...

Continuous Analytics Over Discontinuous Streams

Sailesh Krishnamurthy, Michael Franklin,

Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre

June 10, 2010SIGMOD, Indianapolis

• Founded in 2005

• Roots in TelegraphCQ project from UC Berkeley

• HQ in Foster CIty, CA

• Focus on “Continuous Analytics”

• Fortune 100 and web-based Big Data Customers

3

Data Records / “Events”

Update Display

Real-TimeAnalysis

CQ Processor

Source Data

Stream Query Processing (Traditional View)

4

SQL Execution On Streaming Data

• A stream is an unbounded sequence of records

• A table is a set of records

• Window operators convert streams to tables

• SQL queries apply to tables

Window Operator

• Each window produces a set of records (a table)• Semantics:

• Repeatedly apply generic SQL to the results of window operators

• Results are continuously appended to the output stream

5

Example: SQL Queries over Streams

SELECT I.Advertiser, SUM(I.price*I.volume)FROM Impressions I <VISIBLE ‘5 sec’ ADVANCE ‘3 sec’>, Campaigns CWHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’GROUP BY I.Advertiser

“I want to look at 5 seconds worth of impressions”

“I want results every 3 seconds”

Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window”

Result(s)

Impression Data Stream

Result(s)

…

Window

Window Operator Clause

Assumptions About Streams

6

Continuous sequences

Arriving mostly in order

467 5 38 1, 2

The Reality

7

6

9

10 5

3

3

5

4 2

94 3

2

4

Minutes, Hours, Days, late arriving Data

Multiple streams out of sync, with gaps, …

1, 5, ?

Traditional (in Order) Solution #1: “Slack”

8

1 1 1 2 2 1,2 3 3 1,2,3 4 2 1,2,2,3 5 6 6 1,2,2,3 6 5 5,6 7 1 5,6 8 9 9 5,6 9 8 8,9

Time Stamp

3-Second Slack Buffer

OUTPUTTuple #

Slack

9

• Pros• Simple• Handles “jitter” (slightly out of

order arrival)

• Cons• Introduces delay• Permanently drops arrivals later than buffer• Unbounded buffer size• Permanently drops arrivals if lulls in multiple

input streams

Traditional (in Order) Solution #2: “Drift”

10

(A,1) (a,2) (A,1)(B,2) (b,3) (a,2), (B,2)(C,3) (c,4) (b,3), (C,3)(G,4) (d,5) (c,4), (G,4)(D,6) (d,5)(E,7) (D,6),(E,7)(R,8) (E,7),(R,8) (D,6)(F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9)

Source2

2-Second Drift Buffer

OUTPUTSource 1

Drift

11

• Pros• Simple• Handles multiple streams with

short “lulls” in arrival

• Cons• Doesn’t handle streams with dramatically

different arrival rates• Permanently drops data that arrives after drift

window has expired

Traditional Solution #3: Order-agnostic Operators

12

• Slack and Drift aim to order streams before presenting them to order-sensitive operators

• Many operators don’t care about order

SELECT count(*), cq_close(*) tsFROM S <slices ‘5 seconds’>

Out of Order Processing: Count Example

13

1 1 1 2 3 2 3 2 3 4 4 4 5 5 (4,t=5) 6 6 1 7 2 1 8 9 2 9 7 3 10 3 3 11 10 (3,t=10)

Time Stamp

CountState

OUTPUTTuple

#Heart-Beat

Order-agnostic Operators

14

• Pros• No buffering• No extra delays• Handles out-of-order tuples that

make it before heart-beat

• Cons• Some operators do care about order• Permanently drops data that arrives after

heartbeat• Note: Lost data also impacts bigger “roll up

queries” e.g. <slices 15 seconds> with sharing

So, how to handle very late data and discontinuous streams?

15

16

Integration Framework

Shared Stream Query Processor

Persistent Data Store

SQL Interface

Raw Data Aggregates

“Stream-Relational” Architecture [CIDR 09]

JDBC / JMS XML Flat files ETL tools SOAP APIs

Data Warehouse

App Logic / UDFs

Other TrucQ’s

17

Order-Independent Processing: Overview

• Answers that have already been delivered can only be compensated

• Need to preserve all arriving data • Queries return answers based on

all relevant data that has arrived:• CQ’s: Continuous Queries• SQ’s: SQL queries on archived streams & answers

• Approach: Leverage benefits of SQL(!):• Data-Parallel processing w/on-demand consolidation• Powerful “View” mechanisms

• Basically, create parallel partitions for late data• Rewrite queries as views over partial results


18

1 1 1 2 3 2 3 2 3 4 4 4 5 2 5 6 1 6 7 5 (6,t=5) 8 6 1 9 2 1 1 10 9 2 1 11 7 3 1

DataTS Control

Count State Partitions

OUTPUTTuple

#


19

11 7 3 1 12 3 3 2 13 10 2 (3,t=10) 14 12 1 2 15 8 1 1 (2,t=5) 16 4 1 1 17 3 1 2 18 9 2 2 19 15 2 2 (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5)

DataTS Control

Count State Partitions

OUTPUTTuple # (6,t=5)


20

(6,t=5)(3,t=10) (2,t=5)(1,t=15)(2,t=10)(2,t=5)

OUTPUT• Treat output as “Partial State Records”• Rewrite queries using views over PSRs

• i.e., consolidate On-Demand• Paper goes into substantial detail

on how rewrites work• <Slices 5 second>

• Same answer as Order-Insensitive• <Slices 15 second> as roll-up

• Answer contains all data• Subsequent SQs over archived results

and raw data contain all data too!

Handles Very Late Data, Plus You Get…

21

• Parallel Processing – Multicore and Cluster

U

U

D

D

D

D

D

Client

Client

Client

ClientH

igh

-ba

nd

wid

th N

etw

ork

In

terc

on

ne

ct

D = Distributed Processing NodeU = Unified Processing Node

Other Details in the Paper

22

• Beyond late data and parallelism, approach also is key to supporting:• Fault Tolerance using replication• High-Availability via fast restart• “Nostalgic” continuous queries that start in the

past and catch up to the present• Fast concurrent creation of archives for new CQs

• Algorithmic/Systems details on• Integration with overall system architecture• Interaction with Transaction Mechanism• Need for Background Reducer task• Hybrid Plans for non-parallelizable parts of queries

Conclusions

23

• Early Stream Processing Systems were based on simplistic assumptions about ordering

• Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped

• Approach leverages strengths of SQL• Data-parallel processing models• Sophisticated and efficient view functionality

• Key is On-Demand Consolidation• Of course, you can only do it if you have an

integrated stream-relational system

For more info: [email protected] or [email protected]

continuous analytics over discontinuous streams sailesh krishnamurthy, michael franklin, jeff davis,...

Documents