stream the stanford data stream management system

32
STREAM The Stanford Data Stream Management System

Upload: bartholomew-williams

Post on 26-Dec-2015

238 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: STREAM The Stanford Data Stream Management System

STREAM

The Stanford Data Stream Management System

Page 2: STREAM The Stanford Data Stream Management System

Presentation Structure

• Introduction

• CQL: Continuous Query Language– Abstract Semantics– Data Types– Operators

• Query Plan & Execution

Page 3: STREAM The Stanford Data Stream Management System

Introduction

• The system is designed for limited resource environments where streams may be rapid, and query loads may vary over time.

Page 4: STREAM The Stanford Data Stream Management System

CQL: Continuous Query Language

• For simple continuous queries over streams, it can be sufficient to use a relational query language such as SQL.

• However, for more complex queries this can quickly become very unclear.

Page 5: STREAM The Stanford Data Stream Management System

Abstract Semantics

• 2 Data Types:– Streams

– Relations

• defined on a discrete, ordered time domain • 3 Types of Operators

Page 6: STREAM The Stanford Data Stream Management System

Streams

A stream S is an unbounded bag (multiset) of pairs <s,t>, where s is a tuple and t is the timestamp that denotes the logical arrival time of tuple s on stream S.

Page 7: STREAM The Stanford Data Stream Management System

Relations

A relation R is a time-varying bag of tuples. The bag of tuples at time t is denoted R(t), and we call R(t) an instantaneous relation.

Note that tuples in R(t) have no time-stamp.

Page 8: STREAM The Stanford Data Stream Management System

Operator Diagram

Page 9: STREAM The Stanford Data Stream Management System

Operator Classes

• A relation-to-relation operator takes one or more relations as input and produces a relation as output.

• A stream-to-relation operator takes a stream as input and produces a relation as output.

• A relation-to-stream operator takes a relation as input and produces a stream as output.

• Stream-to-stream operators are absent they are composed from operators of the above classes.

Page 10: STREAM The Stanford Data Stream Management System

Query Structure

• A continuous query Q is a tree of operators belonging to the aforementioned classes.

• The inputs of Q are the streams and relations that are input to the leaf operators.

• The output of Q is the output of the root operator.

• The output is either a stream or a relation, depending on the class of the root operator.

Page 11: STREAM The Stanford Data Stream Management System

Output Timestamp

• Since at time t, an operator of Q logically depends on its inputs up to t.

• The operator produces new outputs corresponding to t

• tuples of S with timestamp t if the output is a stream S, or instantaneous relation R(t) if the output is a relation R

Page 12: STREAM The Stanford Data Stream Management System

Relation-to-Relation Operators

• CQL uses SQL constructs to express its relation-to-relation operators

• i.e. SELECT ... FROM …

Page 13: STREAM The Stanford Data Stream Management System

Class Operator Diagram

Performs duplicate eliminationrelation-to-relationduplicate-eliminate

Performs grouping and aggregationrelation-to-relationaggregate

Antisemijoin of two input relationsrelation-to-relationantisemijoin

Bag Intersectionrelation-to-relationintersect

Bag Differencerelation-to-relationexcept

Bag Unionrelation-to-relationunion

Multiway join from [22]relation-to-relationmjoin

Joins two input relationsrelation-to-relationbinary-join

Duplicate-Preserving Projectionrelation-to-relationproject

Filters elements based on predicate(s)relation-to-relationselect

DescriptionOperator TypeName

Page 14: STREAM The Stanford Data Stream Management System

Stream-to-Relation Operators

• Based on a sliding window principle.

• 3 Types of Windows:– Tuple-based window– Time-based window– Partitioned Widow

Page 15: STREAM The Stanford Data Stream Management System

Tuple-based Window

• A tuple-based sliding window on a stream S takes an integer N > 0 as a parameter and produces a relation R. At time t, R(t)contains the N tuples of S with the largest timestamps < t.

• Example: R(14) [Rows 5]

Page 16: STREAM The Stanford Data Stream Management System

Time-based window

• A time-based sliding window on a stream S takes a time interval w as a parameter and produces a relation R. At time t, R(t) contains all tuples of S with timestamps between t-w and t.

• Example: R(9) [Range 4]

Page 17: STREAM The Stanford Data Stream Management System

Partitioned Window

• A partitioned sliding window on a stream S takes an integer N and a set of attributes {A1, ..., Ak } of S as parameters, and is specified by following S with [Partition By A1,...,Ak Rows N]." It logically partitions S into different sub streams.

• HINT: Rows N will be used a tuple-based window on the substreams.

Page 18: STREAM The Stanford Data Stream Management System

Relation-to-Stream Operators

• 3 Relation-to-stream operators:

• Istream (for Insert Stream)

• Dstream (for Delete Stream)

• Rstream (for Relation Stream)

Page 19: STREAM The Stanford Data Stream Management System

R-to-S Operators

• IS: Applied to a relation R contains <s,t> whenever tuple s is in R(t) − R(t − 1)– i.e., whenever s is inserted into R at time t.

• DS: Applied to a relation R contains <s,t> whenever tuple s is in R(t − 1) − R(t)– i.e., whenever s is deleted from R at time t.

• RS: Applied to a relation R contains <s,t> whenever tuple s is in R(t)– i.e., every current tuple in R is streamed at

every time instant.

Page 20: STREAM The Stanford Data Stream Management System

Example 1 CQL Query

• Select Istream(*) From S [Rows Unbounded] Where S.A > 10

– S[Rows Unbounded] (stream-to-relation)

– S.A > 10 (relation-to-relation)

– IStream(*) (relation-to-stream)

Page 21: STREAM The Stanford Data Stream Management System

Example 2 CQL Query

• Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10– S1 [Rows 1000] (Stream-to-Relation)

– S2 (Range 2min] (Stream-to-Relation)

– S1.A = S2.A (Relation-to-relation)

– S1.A > 10 (Relation-to-Relation)

– * (Relation-to-Relation)

Page 22: STREAM The Stanford Data Stream Management System

Example 3 CQL Query

• Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A

– S[Now] (Stream-To-Relation)– R (Stream-To-Relation)

• assumes [Rows Unbounded]

– S.A = R.A (Relation-To-Relation)– RStream(S.A, R.B) (Relation-To-Stream)

Page 23: STREAM The Stanford Data Stream Management System

Query Plans & Execution

• When a continuous query is to be executed within STREAM, a query plan is compiled from it.

• Query plans are composed of:– Operators (to perform the actual processing)– Queues (buffer tuples as they move between

operators)– Synposes (which I will not discuss)

Page 24: STREAM The Stanford Data Stream Management System

Operators

• In order to allow processing, each timestamped tuple is additionally flaged for 'insertion' or 'deletion' (+ or -)

• Streams only include + elements, while relations may include both + and − elements

• Each query plan operator reads from one or more input queues, processes the input based on its semantics, and writes any output to an output queue.

Page 25: STREAM The Stanford Data Stream Management System

Queues

• A queue in a query plan connects its “producing" plan operator OP to its “consuming" operator OC. At any time a queue contains a (possibly empty) collection of elements representing a portion of a stream or relation.

• Furthermore, the system requires all queues to enforce non-decreasing timestamps, to allow for all possible operations. (Very Important)

Page 26: STREAM The Stanford Data Stream Management System

Queue Diagram

Page 27: STREAM The Stanford Data Stream Management System

Query Plan (Example 2)

• Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10

Page 28: STREAM The Stanford Data Stream Management System

Query Plan (Example 1)

• Select Istream(*) From S [Rows Unbounded] Where S.A > 10

• Q1: Stream Queue

• SW: all of Q1 copied

• Q2: Relation

• Sel: on S.A > 10

• Q3: Relation

• I-S: R-to-S

• Q4: Stream

Page 29: STREAM The Stanford Data Stream Management System

Query Plan (Example 3)

• Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A

Page 30: STREAM The Stanford Data Stream Management System

Query Plan Scheduling

• When a query plan is executed, a scheduler selects operators in the plan to execute in turn.

• The semantics of each operator depends only on the timestamps of the elements it processes, not on system or “wall-clock" time.

• Thus, the order of execution has no effect on the data in the query result, although it can affect other properties such as latency and resource utilization.

Page 31: STREAM The Stanford Data Stream Management System

Execution Example

• The first seq-window (now just called SW1) reads (s,r,+)

• SW1 stores the tuple in its own buffer

• If buffer is full, more than 1000 elements, it removes oldest element called s'.

• SW1 writes to q3 (s,r,+) and (s',r,-)

• SW2 works similary.

• Binary-Join (now called BJ3) reads (s,r,+) from q3

• Stores it in buffer 1

• Joins tuple with all elements of buffer 2

• Outputs (st,r,+) for t in buffer 2

Page 32: STREAM The Stanford Data Stream Management System

Execution (Part 2)

• BJ3 processes all its input queues in non-decreasing order.

• The Select Operator simply checks its input elements against its predicate and outputs those that pass.