continuous stream monitoring technology elke a. rundensteiner database systems research laboratory...
Post on 20-Dec-2015
218 views
TRANSCRIPT
Continuous Stream Monitoring Technology
Elke A. Rundensteiner
Database Systems Research LaboratoryDepartment of Computer Science
Worcester Polytechnic Institute, USArundenst @cs.wpi.edu
October 2006
2
Project Topics in a Nutshell
Distributed Data Sources: EVE : Data Warehousing over
Distributed Data TOTAL-ETL : Distributed
Extract Transform Load[NSF’96,NSF02,IBM]
XML/Web Data Systems: RAINBOW : XML to
Relational Databases MASS : Native XQuery
Processing System [Verizon,IBM,NSF05]
Databases & Visualization: Scalable Visual High-Dim.
Data Exploration Data and Visual Quality
Support in XMDV
[NSF’97,NSF01,NSF05]
Stream Monitoring System: Scalable Query Engine for
Data Streams Fire Prediction and
Monitoring Appl.
[NSF06, NEC ]
3
Why Database Technology?
Vast amount of electronic information in organisations, companies, and scientific institutes that needs to be organized, stored securily, and accessed efficiently
Database management systems (DBMSs) provide: Model for logical structure of information Query languages to access and modify data Persistent data storage over long time Index technologies Efficient query processing and optimization Concurrent access for multiple users Access rights and security Scalability in query workload and data size
Stored Database
DBMS
Select namefrom employee;
4
Generations of DBMSs
Early DBMSs Navigational access
Relational DBMSs Traditional tables and SQL queries
Object-oriented DBMSs Object modeling and extensibility
Object-relational DBMSs Combine declarative queries with OO modeling
XML DBMSs Support web and semi-structured data types
5
Question . . . ?
What is common among these DBMSs ?
Stored Database
DBMS
Select namefrom employee;
6
Answer . . .
Three common steps : Make schema design Load database Query static database
Key Differences: Different data models
Stored Database
DBMS
Select namefrom employee;
8
A Look at Modern Applications
Digital radio telescopes Network traffic monitoring Environmental Monitoring Tracking using RFID Tags Sensor networks Analyses of web usage logs Financial analysis of stock
exchanges Out-patient critical care . . .
Filter & Transform
select fft(s)from radiosignal swhere source(s)= “Antenna1”;
9
A Look at Modern Applications
What do those applications have in common ?Filter & Transform
select fft(s)from radiosignal swhere source(s)= “Antenna1”;
11
Databases : A Paradigm Shift !
data
Query
Query
Query
Query
data
data
data
data
streamsof data
static data
Ad-hoc one-time queries
Continuous standing queries
12
Data Streams and Continous Queries
Data streams: Continuous on-line ordered sequences Produced by sensors, simulations, and instruments Data pushed to reactive applications Result also continuous output streams
Stream queries: Continuous long-running or even infinite queries On-the-fly real-time processing as data arrives Constrained processing time and memory usage Selective stream storage (often of recent past)
13
Requirements for Data Stream Management Systems (DSMSs)
Non-blocking operators in query plans
Windows: Infinite streams into finite sub-streams
One-pass query algorithms
Approximate query answers
Real-time response for unusual behavior detected
Adaptation to environmental changes
14
DSMS Provides:
High-level query language (declarative interface) Data independence from physical stream
implementations Query optimization (for performance) Scalability in data volume and query workload Shared execution of similar queries Adaptive distributed processing
15
Real-time Stream Query Processing: Parallelism
Process Queries on shared-nothing architectures (cluster or Grid )
Make use of aggregated resources (main memory, CPU)
Network
Clusters of MachinesQuery Workload
Acquired NSF Equipment grant 2006 for Purchase of High-Performance Cluster For Stream Processing Applications
16
Three Types of Parallelism We Exploit
Pipelined:Operators be composed into producer and consumer relationship
Independent:Independent operators run simultaneously on distinct machines
Partitioned:Single operator replicated and run on multiple machines
Adaptation Considered Within Each Processing Paradigm
18
Scuba Project : Mobile Application Streams
Scalability Large number of objects Large number of queries
Limited Resources Memory CPU
Real-time Response Requirement
The challenge is to provide fast query response in update-intensive environments
- moving objects- dynamic range query
- dynamic kNN query
Novel Idea: Exploit thefact that objects naturally move
in groups (i.e., clusters) to optimize query evaluation
19
Spatio-Temporal Continuous Tracking
Monitor the traffic in the
red areas
Continuously return the
area covered by the herd during the migration
20
Main Idea: Moving Clusters
Main Idea: Abstracting individual objects into a cluster based on common attributes
- Direction
- Speed
- Spatial Position
With cluster abstractions,
minimize the number of unnecessary individual object/query joins, thus optimizing query evaluation
Continuously retrieve closest police car next
to me
Police Car
Scalable Cluster-Based Algorithm for Evaluating Continuous Spatio-Temporal Queries on Moving Objects (SCUBA)
21
Advantage of Moving Cluster Abstraction
When clusters don’t overlap, we avoid many joins of individual objects within those clusters
m1m2
No need to join objects/queries in m1 with queries/objects in m2
- Moving object - Spatio-temporal range query
Scuba presented April 2006 at EDBT’06
If two abstractions do not ‘overlap' then we can discard negative candidates
and avoid individual joins for spatio-temporal range queries.
Raindrop : XQueries on XML Streams (or, Automaton Meets Algebra)
Funded by NSF 2005;
In collaboration with Prof. Mani
24
What’s Special for XML Stream Processing?<Biditems>
<book year=“2001">
<title>Dream Catcher</title>
<author><last>King</last><first>S.</first></author>
<publisher>Bt Bound </publisher>
<price> 30 </initial>
</book>
…
<biditems> <book> <title> Dream Catcher </title> …
Token-by-Token access manner
timeline
Pattern retrieval + Filtering + Restructuring
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
Token: not a direct counterpart of a tuple
30Bt BoundS.KingDream2001
pricepublisherfirstlasttitleyear
Pattern Retrieval on Token Streams
25
Automata-Based Paradigm
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
1book*
2
4title
price
Auxiliary structures for:
1. Buffering data
2. Filtering
3. Restructuring
…
//book
//book/title
//book/price3
26
Observations
Either paradigm has deficiencies
Both paradigms complement each other
Automata Paradigm Algebra Paradigm
Good for pattern retrieval on tokens Does not support token inputs
Need patches for filtering and restructuring
Good for filtering and restructuring
Present all details on same low level Support multiple descriptive levels (declarative->procedural)
Little studied as query processing paradigm
Well studied as query process paradigm
27
Towards One Uniform Algebraic View
Token-based plan (automata plan)
Tuple-based plan
Tuple stream
XML data stream
Query answer
Algebraic Stream Plan
28
Example Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Tuple-based plan
Token-based plan (automata plan)
29
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
Tuple-based plan
30
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
Select$p<30
Tagger “Inexpensive”, $t->$r
31
Plan Rewriting : In or Out?
Token-based plan (automata plan)
Tuple-based Plan
Tuple stream
XML data stream
Query answer
Pattern retrieval in Semantics-focused plan
Apply “push into automata”
32
Raindrop Plan Alternatives
Nav $b, /price->$p
ExtractNest $b, $p
ExtractNest $b, $t
SJoin //book
Select price < 30
Tagger
Nav $b, /title->$t
Nav $S1, //book->$b
ExtractNest $S1, $b
Navigate /price
Select price<30
Navigate book/title
Tagger
Nav $S1, //book->$b
NavUnnest $S1, //book ->$b
NavNest $b, /price ->$p
NavNest $b, /title ->$t
Select$p<30
Tagger “Inexpensive”, $t->$r
Out In
Statistics Collection and On-line Plan Migration
33
Raindrop : Research Contributions and Issues
Costing/query optimization of plans On-the-fly migration into/out of automaton Physical implementation strategies of operators Exploit XML schema constraints for query
optimization
Load-shedding from an automaton Early memory release optimization
Published in CIKM’03, ER’03, DKE’06 Journal, VLDB’05, VLDB’06.
35
Fire Monitoring Queries Ambient Queries: What are typical temperature and humidity in given
rooms based on environment ?
Detection Queries: Unusual behaviors or patterns detected ?
Tracking Queries: Track smoke and heat clouds (moving clusters) in terms of their sizes and speeds.
Analysis Queries : Is there an outlier (prank), or an actual fire ?
Reliabity Assessment: Any sensors faulty, and thus should be ignored?
Prediction Queries: Match sensors readings of fire with a fire stream simulation to determine similarity ?
FireStream Demo to be presented at ICDE’07
36
Project : RFID Event Stream Monitoring Given potentially infinite, heterogeneous, high-speed event
streams
Goal: detect interesting patterns among events Supply chain management, e.g., (“insufficient inventory”→“no-
backup”) or “inventory overflow” Business service optimization, e.g., “search ticket”→“timeout” Anomaly detection, e.g., “pick item”→“no checkout”→“exit” And more …
Complex query patterns to be answered in real-time
Supported by NEC Cupertino and NSF Princeton
37
Event Processing Example
Event streampick(1), pick(2), pick(3), checkout(3), pick(4), exit(2), …
Event Pattern QueryEVENT SEQ(PICK p, !(CHECKOUT c), EXIT e)WHERE p.id=c.id AND c.id=e.idWITHIN 12 hours
Processing Sequence scan & construction : (p, e) pairs Selection : apply predicates Window : check time constraints Negation : check for negation Transformation : make complex output event
Time
38
Challenges for High-Performance Processing
Use “Workflows” to Early Terminate Pattern Queries
Optimize Event Pattern Queries Using Rewriting
Prefix Sharing of Multiple Event Pattern Queries
Scalable Processing Using Cluster
39
CAPE: Uncertainties in Stream Query Processing
RegisterContinuous
Queries
Scalable Stream Query Engine
Scalable Stream Query Engine
Streaming Data
(push-based paradigm)
Streaming Result
Real-time and accurate responses
required
May have time-varying rates and
high-volumesAvailable resources for
executing each operator may vary over time.
Distribution and Adaptations are required.
High workload of queries
Memory- and CPU resource limitations
(continuous evaluation)
40
CAPE : Continuous Adaptive Processing Engine -- Adaptation at all Layers
Reactive Operator Algorithms Adaptive Scheduling of Operators On-Line Query Plan Reshaping Multi-Query Pipeline Sharing Synchronized Data Tree Spilling Adaptive Cluster-Driven Load Shedding Dynamic Workload Distribution over Cluster Data-Partitioning for Parallel Stream Processing
41
Adaptation Techniques in CAPE
On-Line Query Plan Reshaping
(with Yali Zhu and G. Heineman )
Published in ACM SIGMOD’ 2004, and in Submission to TODS journal
42
Run-time Plan Re-Optimization
Step1 - Decide when to optimize Statistics monitoring
Step2 – Generate new query plan Query optimization
Step3 – Replace current plan by new plan Plan Migration
43
Naïve Plan Migration Strategy
Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan
AB
BC
A B C
AB
BC
A B C
Problem: Works for stateless operators only
44
Stateful Operator in CQ Why stateful
Need non-blocking operators in CQ Operator needs to output partial results
AB
A B
State A State B
Key Observation: The purge of tuples in states relies on processing of new tuples.
Symmetric hash joinFor each new tuple A
purge state B, join state B, insert to state A
45
Naïve Migration Strategy Revisited
Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan
AB
BC
A B C(2)
All tuples drained
(4)Processing
Resumed
(3) Old Replaced
By new
Deadlock Waiting Problem:
47
Moving State Strategy
Basic idea Share common states between two boxes
Key Steps Identify common states
State matching Share common states
State moving Recompute unmatched states
State recomputing
48
Moving State Strategy
State Matching State in old box has unique ID During rewriting, new ID given to
newly generated state in new box
When rewriting done, match states based on IDs.
State Moving Between matched states On same machine, creates new
pointers for matched states in new box
What’s left? Unmatched states in new box
CDSABC SD
BCSAB SC
ABSA SB
ABSA SBCD
CDSBC
SD
BCSB SC
QA QB QC QD QA QB QC QD
QABCD QABCD
Old Box New Box
49
Unmatched States
State Recomputing Recursively recompute
unmatched SBC and SBCD by
joining matched states
Why always possible? Old and new boxes have same
input queues The states associated with input
queues always match
Why necessary?
ABSA SBCD
CDSBC SD
BCSB SC
QA QB QC QD
QABCD
50
MS Migration Pros and Cons
Pros Fast when # of tuples in states is small
Low input rates or small window size
Cons Output silence during entire migration stage Can we output results even during migration?
Motivation for Parallel Track Strategy
51
Parallel Track Strategy
Basic idea Execute both old and new plans in parallel Gradually “push” old tuples out of old box by purging
Key Steps Connect new box Execute both boxes in parallel Remove old box once “expired”
Contains only new tuples No old tuples or sub-tuples
52
Parallel Track Strategy
Connect boxes Execute in parallel
Until all old tuples purged Disconnect old boxCD
SABC SD
BC
SAB SC
AB
SA SB
AB
SASBCD
CD
SBC SD
BCSB SC
QA QB QCQD
QA QB QC QD
QABCD QABCD
A Tuple ABC in SABC
A B C
53
PT Migrations Pros and Cons
Pros Keep on producing results even during migration
No results during MS migration
Cons Migration duration is at least 2W
MS may be faster depends on # of tuples in states
54
Summary : Stream Plan Migration First run-time solution for stateful operators Two migration methods:
Moving State Strategy Parallel Track Strategy
Cost Models and Experimental Evaluations
What next ? Scope of optimization ? Support of other stateful operators ? Migration in distributed stream systems ?
55
Overall Summary : So Much Left to Do !
Large variety of challenging stream applications
Generic core technology for stream processing engines
Our central theme : Optimization via Adaptation
Part I: Plan migration Part II: Plan distribution Part III: Plan-level spill
Many open questions remain . . .
57
Acknowledgments All the students (Ph.d., MS, and undergraduate)
in the DSRG lab who have contributed to this research project directly or indirectly.
Most notably ; Luping Ding, Yali Zhu, Bin Liu, Tim Sutherland, Brad Pielech, Rimma Nehme, Mariana Jbantova, Brad Momberger, Venky Raghavan, Song Wang, Natasha Bogdanova, Mingzhu Wei, Ming Li, and others.
To National Science Foundation for partial support via IDM grants, to WPI for RDC grant, and to IBM and NEC
58
Selected CAPE Publications and Reports
[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004
[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.
[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.
[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.
[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.
[RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Demonstration Paper. VLDB 2004
[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.
[SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive Multi-Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS 2005.
[SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self-Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov. 2005.
[LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB 2005.
[B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, 2005. (in submission)
CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html