continuous stream monitoring technology elke a. rundensteiner database systems research laboratory...

Continuous Stream Monitoring Technology

Elke A. Rundensteiner

Database Systems Research LaboratoryDepartment of Computer Science

Worcester Polytechnic Institute, USArundenst @cs.wpi.edu

October 2006

2

Project Topics in a Nutshell

Distributed Data Sources: EVE : Data Warehousing over

Distributed Data TOTAL-ETL : Distributed

Extract Transform Load[NSF’96,NSF02,IBM]

XML/Web Data Systems: RAINBOW : XML to

Relational Databases MASS : Native XQuery

Processing System [Verizon,IBM,NSF05]

Databases & Visualization: Scalable Visual High-Dim.

Data Exploration Data and Visual Quality

Support in XMDV

[NSF’97,NSF01,NSF05]

Stream Monitoring System: Scalable Query Engine for

Data Streams Fire Prediction and

Monitoring Appl.

[NSF06, NEC ]

3

Why Database Technology?

Vast amount of electronic information in organisations, companies, and scientific institutes that needs to be organized, stored securily, and accessed efficiently

Database management systems (DBMSs) provide: Model for logical structure of information Query languages to access and modify data Persistent data storage over long time Index technologies Efficient query processing and optimization Concurrent access for multiple users Access rights and security Scalability in query workload and data size

Stored Database

DBMS

Select namefrom employee;

4

Generations of DBMSs

Early DBMSs Navigational access

Relational DBMSs Traditional tables and SQL queries

Object-oriented DBMSs Object modeling and extensibility

Object-relational DBMSs Combine declarative queries with OO modeling

XML DBMSs Support web and semi-structured data types

5

Question . . . ?

What is common among these DBMSs ?

Stored Database

DBMS


6

Answer . . .

Three common steps : Make schema design Load database Query static database

Key Differences: Different data models

Stored Database

DBMS


7

So what next ?

Stored Database

DBMS


8

A Look at Modern Applications

Digital radio telescopes Network traffic monitoring Environmental Monitoring Tracking using RFID Tags Sensor networks Analyses of web usage logs Financial analysis of stock

exchanges Out-patient critical care . . .

Filter & Transform

select fft(s)from radiosignal swhere source(s)= “Antenna1”;

9

A Look at Modern Applications

What do those applications have in common ?Filter & Transform

select fft(s)from radiosignal swhere source(s)= “Antenna1”;

10

Continous Queries on Data Streams

OnlineStream

Monitoring

OnlineStream

Monitoring

11

Databases : A Paradigm Shift !

data

Query

Query

Query

Query

data

data

data

data

streamsof data

static data

Ad-hoc one-time queries

Continuous standing queries

12

Data Streams and Continous Queries

Data streams: Continuous on-line ordered sequences Produced by sensors, simulations, and instruments Data pushed to reactive applications Result also continuous output streams

Stream queries: Continuous long-running or even infinite queries On-the-fly real-time processing as data arrives Constrained processing time and memory usage Selective stream storage (often of recent past)

13

Requirements for Data Stream Management Systems (DSMSs)

Non-blocking operators in query plans

Windows: Infinite streams into finite sub-streams

One-pass query algorithms

Approximate query answers

Real-time response for unusual behavior detected

Adaptation to environmental changes

14

DSMS Provides:

High-level query language (declarative interface) Data independence from physical stream

implementations Query optimization (for performance) Scalability in data volume and query workload Shared execution of similar queries Adaptive distributed processing

15

Real-time Stream Query Processing: Parallelism

Process Queries on shared-nothing architectures (cluster or Grid )

Make use of aggregated resources (main memory, CPU)

Network

Clusters of MachinesQuery Workload

Acquired NSF Equipment grant 2006 for Purchase of High-Performance Cluster For Stream Processing Applications

16

Three Types of Parallelism We Exploit

Pipelined:Operators be composed into producer and consumer relationship

Independent:Independent operators run simultaneously on distinct machines

Partitioned:Single operator replicated and run on multiple machines

Adaptation Considered Within Each Processing Paradigm

18

Scuba Project : Mobile Application Streams

Scalability Large number of objects Large number of queries

Limited Resources Memory CPU

Real-time Response Requirement

The challenge is to provide fast query response in update-intensive environments

- moving objects- dynamic range query

- dynamic kNN query

Novel Idea: Exploit thefact that objects naturally move

in groups (i.e., clusters) to optimize query evaluation

19

Spatio-Temporal Continuous Tracking

Monitor the traffic in the

red areas

Continuously return the

area covered by the herd during the migration

20

Main Idea: Moving Clusters

Main Idea: Abstracting individual objects into a cluster based on common attributes

- Direction

- Speed

- Spatial Position

With cluster abstractions,

minimize the number of unnecessary individual object/query joins, thus optimizing query evaluation

Continuously retrieve closest police car next

to me

Police Car

Scalable Cluster-Based Algorithm for Evaluating Continuous Spatio-Temporal Queries on Moving Objects (SCUBA)

21

Advantage of Moving Cluster Abstraction

When clusters don’t overlap, we avoid many joins of individual objects within those clusters

m1m2

No need to join objects/queries in m1 with queries/objects in m2

- Moving object - Spatio-temporal range query

Scuba presented April 2006 at EDBT’06

If two abstractions do not ‘overlap' then we can discard negative candidates

and avoid individual joins for spatio-temporal range queries.

Raindrop : XQueries on XML Streams (or, Automaton Meets Algebra)

Funded by NSF 2005;

In collaboration with Prof. Mani

24

What’s Special for XML Stream Processing?<Biditems>

<book year=“2001">

<title>Dream Catcher</title>

<author><last>King</last><first>S.</first></author>

<publisher>Bt Bound </publisher>

<price> 30 </initial>

</book>

…

<biditems> <book> <title> Dream Catcher </title> …

Token-by-Token access manner

timeline

Pattern retrieval + Filtering + Restructuring

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

Token: not a direct counterpart of a tuple

30Bt BoundS.KingDream2001

pricepublisherfirstlasttitleyear

Pattern Retrieval on Token Streams

25

Automata-Based Paradigm


1book*

2

4title

price

Auxiliary structures for:

1. Buffering data

2. Filtering

3. Restructuring

…

//book

//book/title

//book/price3

26

Observations

Either paradigm has deficiencies

Both paradigms complement each other

Automata Paradigm Algebra Paradigm

Good for pattern retrieval on tokens Does not support token inputs

Need patches for filtering and restructuring

Good for filtering and restructuring

Present all details on same low level Support multiple descriptive levels (declarative->procedural)

Little studied as query processing paradigm

Well studied as query process paradigm

27

Towards One Uniform Algebraic View

Token-based plan (automata plan)

Tuple-based plan

Tuple stream

XML data stream

Query answer

Algebraic Stream Plan

28

Example Algebraic Plan


Tuple-based plan


29

Example Uniform Algebraic Plan


StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Tuple-based plan

30

Example Uniform Algebraic Plan


StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Select$p<30

Tagger “Inexpensive”, $t->$r

31

Plan Rewriting : In or Out?


Tuple-based Plan

Tuple stream

XML data stream

Query answer

Pattern retrieval in Semantics-focused plan

Apply “push into automata”

32

Raindrop Plan Alternatives

Nav $b, /price->$p

ExtractNest $b, $p

ExtractNest $b, $t

SJoin //book

Select price < 30

Tagger

Nav $b, /title->$t

Nav $S1, //book->$b

ExtractNest $S1, $b

Navigate /price

Select price<30

Navigate book/title

Tagger

Nav $S1, //book->$b

NavUnnest $S1, //book ->$b

NavNest $b, /price ->$p

NavNest $b, /title ->$t

Select$p<30

Tagger “Inexpensive”, $t->$r

Out In

Statistics Collection and On-line Plan Migration

33

Raindrop : Research Contributions and Issues

Costing/query optimization of plans On-the-fly migration into/out of automaton Physical implementation strategies of operators Exploit XML schema constraints for query

optimization

Load-shedding from an automaton Early memory release optimization

Published in CIKM’03, ER’03, DKE’06 Journal, VLDB’05, VLDB’06.

34

FireEngine Project : Sensors in Buildings

35

Fire Monitoring Queries Ambient Queries: What are typical temperature and humidity in given

rooms based on environment ?

Detection Queries: Unusual behaviors or patterns detected ?

Tracking Queries: Track smoke and heat clouds (moving clusters) in terms of their sizes and speeds.

Analysis Queries : Is there an outlier (prank), or an actual fire ?

Reliabity Assessment: Any sensors faulty, and thus should be ignored?

Prediction Queries: Match sensors readings of fire with a fire stream simulation to determine similarity ?

FireStream Demo to be presented at ICDE’07

36

Project : RFID Event Stream Monitoring Given potentially infinite, heterogeneous, high-speed event

streams

Goal: detect interesting patterns among events Supply chain management, e.g., (“insufficient inventory”→“no-

backup”) or “inventory overflow” Business service optimization, e.g., “search ticket”→“timeout” Anomaly detection, e.g., “pick item”→“no checkout”→“exit” And more …

Complex query patterns to be answered in real-time

Supported by NEC Cupertino and NSF Princeton

37

Event Processing Example

Event streampick(1), pick(2), pick(3), checkout(3), pick(4), exit(2), …

Event Pattern QueryEVENT SEQ(PICK p, !(CHECKOUT c), EXIT e)WHERE p.id=c.id AND c.id=e.idWITHIN 12 hours

Processing Sequence scan & construction : (p, e) pairs Selection : apply predicates Window : check time constraints Negation : check for negation Transformation : make complex output event

Time

38

Challenges for High-Performance Processing

Use “Workflows” to Early Terminate Pattern Queries

Optimize Event Pattern Queries Using Rewriting

Prefix Sharing of Multiple Event Pattern Queries

Scalable Processing Using Cluster

39

CAPE: Uncertainties in Stream Query Processing

RegisterContinuous

Queries

Scalable Stream Query Engine

Scalable Stream Query Engine

Streaming Data

(push-based paradigm)

Streaming Result

Real-time and accurate responses

required

May have time-varying rates and

high-volumesAvailable resources for

executing each operator may vary over time.

Distribution and Adaptations are required.

High workload of queries

Memory- and CPU resource limitations

(continuous evaluation)

40

CAPE : Continuous Adaptive Processing Engine -- Adaptation at all Layers

Reactive Operator Algorithms Adaptive Scheduling of Operators On-Line Query Plan Reshaping Multi-Query Pipeline Sharing Synchronized Data Tree Spilling Adaptive Cluster-Driven Load Shedding Dynamic Workload Distribution over Cluster Data-Partitioning for Parallel Stream Processing

41

Adaptation Techniques in CAPE

On-Line Query Plan Reshaping

(with Yali Zhu and G. Heineman )

Published in ACM SIGMOD’ 2004, and in Submission to TODS journal

42

Run-time Plan Re-Optimization

Step1 - Decide when to optimize Statistics monitoring

Step2 – Generate new query plan Query optimization

Step3 – Replace current plan by new plan Plan Migration

43

Naïve Plan Migration Strategy

Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan

AB

BC

A B C

AB

BC

A B C

Problem: Works for stateless operators only

44

Stateful Operator in CQ Why stateful

Need non-blocking operators in CQ Operator needs to output partial results

AB

A B

State A State B

Key Observation: The purge of tuples in states relies on processing of new tuples.

Symmetric hash joinFor each new tuple A

purge state B, join state B, insert to state A

45

Naïve Migration Strategy Revisited

Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan

AB

BC

A B C(2)

All tuples drained

(4)Processing

Resumed

(3) Old Replaced

By new

Deadlock Waiting Problem:

46

Proposed Dynamic Migration Strategies

Moving State Strategy Parallel Track Strategy

47

Moving State Strategy

Basic idea Share common states between two boxes

Key Steps Identify common states

State matching Share common states

State moving Recompute unmatched states

State recomputing

48

Moving State Strategy

State Matching State in old box has unique ID During rewriting, new ID given to

newly generated state in new box

When rewriting done, match states based on IDs.

State Moving Between matched states On same machine, creates new

pointers for matched states in new box

What’s left? Unmatched states in new box

CDSABC SD

BCSAB SC

ABSA SB

ABSA SBCD

CDSBC

SD

BCSB SC

QA QB QC QD QA QB QC QD

QABCD QABCD

Old Box New Box

49

Unmatched States

State Recomputing Recursively recompute

unmatched SBC and SBCD by

joining matched states

Why always possible? Old and new boxes have same

input queues The states associated with input

queues always match

Why necessary?

ABSA SBCD

CDSBC SD

BCSB SC

QA QB QC QD

QABCD

50

MS Migration Pros and Cons

Pros Fast when # of tuples in states is small

Low input rates or small window size

Cons Output silence during entire migration stage Can we output results even during migration?

Motivation for Parallel Track Strategy

51

Parallel Track Strategy

Basic idea Execute both old and new plans in parallel Gradually “push” old tuples out of old box by purging

Key Steps Connect new box Execute both boxes in parallel Remove old box once “expired”

Contains only new tuples No old tuples or sub-tuples

52

Parallel Track Strategy

Connect boxes Execute in parallel

Until all old tuples purged Disconnect old boxCD

SABC SD

BC

SAB SC

AB

SA SB

AB

SASBCD

CD

SBC SD

BCSB SC

QA QB QCQD

QA QB QC QD

QABCD QABCD

A Tuple ABC in SABC

A B C

53

PT Migrations Pros and Cons

Pros Keep on producing results even during migration

No results during MS migration

Cons Migration duration is at least 2W

MS may be faster depends on # of tuples in states

54

Summary : Stream Plan Migration First run-time solution for stateful operators Two migration methods:

Moving State Strategy Parallel Track Strategy

Cost Models and Experimental Evaluations

What next ? Scope of optimization ? Support of other stateful operators ? Migration in distributed stream systems ?

55

Overall Summary : So Much Left to Do !

Large variety of challenging stream applications

Generic core technology for stream processing engines

Our central theme : Optimization via Adaptation

Part I: Plan migration Part II: Plan distribution Part III: Plan-level spill

Many open questions remain . . .

56

Thank You For Your Patience !

The End

57

Acknowledgments All the students (Ph.d., MS, and undergraduate)

in the DSRG lab who have contributed to this research project directly or indirectly.

Most notably ; Luping Ding, Yali Zhu, Bin Liu, Tim Sutherland, Brad Pielech, Rimma Nehme, Mariana Jbantova, Brad Momberger, Venky Raghavan, Song Wang, Natasha Bogdanova, Mingzhu Wei, Ming Li, and others.

To National Science Foundation for partial support via IDM grants, to WPI for RDC grant, and to IBM and NEC

58

Selected CAPE Publications and Reports

[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004

[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.

[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.

[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.

[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.

[RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Demonstration Paper. VLDB 2004

[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.

[SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive Multi-Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS 2005.

[SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self-Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov. 2005.

[LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB 2005.

[B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, 2005. (in submission)

CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html

continuous stream monitoring technology elke a. rundensteiner database systems research laboratory...

Documents