efficient processing of massive data streams for mining and monitoring
DESCRIPTION
Efficient Processing of Massive Data Streams for Mining and Monitoring. Mirek Riedewald Department of Computer Science Cornell University. Acknowledgements. Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke KD-D initiative (Art Becker et al.). Introduction. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/1.jpg)
Mirek RiedewaldDepartment of Computer Science
Cornell University
Efficient Processing of Massive Data Streams for
Mining and Monitoring
![Page 2: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/2.jpg)
Acknowledgements
Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke KD-D initiative (Art Becker et al.)
![Page 3: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/3.jpg)
Introduction
Data streams versus databases Infinite stream, continuous queries Limited resources
Network monitoring High arrival rates, approximation [CGJSS02]
Stock trading Complex computation [ZS02]
Retail, E-business, Intelligence, Medical Surveillance Identify relevant information on-the-fly, archive
for data mining Exact results, error guarantees
![Page 4: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/4.jpg)
Information Spheres
Local Information Sphere Within each organization Continuous processing of distributed
data streams Online evaluation of thousands of
triggers Storage/archival of important data
Global Information Sphere Between organizations Share data in privacy preserving way
![Page 5: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/5.jpg)
Local Information Sphere
Distributed data stream event processing and online data mining
Technical challenges Blocking operators, unbounded state Graceful degradation under increasing load Integration with archive Processing of physically distributed streams
![Page 6: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/6.jpg)
Event Matching, Correlation
Join of data streams
Brand Mpix Price
Canon
3.0 200
Mpix Price
>2.0 <250
![Page 7: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/7.jpg)
Event Matching, Correlation
Join of data streams
Brand Mpix Price
Canon
3.0 200
Fuji 3.0 100
Mpix Price
>2.0 <250
>4.0 <400
![Page 8: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/8.jpg)
Event Matching, Correlation
Join of data streams
Equi-join, text similarity, geographical proximity,…
Problem: unbounded state, computation
Brand Mpix Price
Canon
3.0 180
Fuji 3.0 220
Kodak
4.0 340
Mpix Price
> 2.0 < 250
> 4.0 < 400
= 3.0 < 200
![Page 9: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/9.jpg)
Window Joins
Restrict join to window of most recent records (tuples) Landmark window Sliding window based on time or
number of records Problem definition
Window based on time: size w Synchronous record arrival Equi-join
![Page 10: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/10.jpg)
Abstract Model
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that r.A=s.A
Sliding window of size w
1 1 1
2 3 1
R
S
(r0,s2), (r1,s2), (r2,s2)
![Page 11: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/11.jpg)
Abstract Model (cont.)
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that r.A=s.A
Sliding window of size w
1 1 1 3
2 3 1 1
R
S
(r0,s2), (r1,s2), (r2,s2)(r3,s1), (r1,s3), (r2,s3)
![Page 12: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/12.jpg)
Abstract Model (cont.)
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that r.A=s.A
Sliding window of size w
1 1 1 3 2
2 3 1 1 4
R
S
(r0,s2), (r1,s2), (r2,s2)(r3,s1), (r1,s3), (r2,s3)No new output
![Page 13: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/13.jpg)
Limited Resources
Focus on limited memory M<2w State of the art: random load
shedding [KNV03] Random sample of streams Desired approach: semantic load
shedding Goal: graceful degradation
Approximation Set-valued result: Error measure?
![Page 14: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/14.jpg)
Set-Approximation Error
What is a good error measure? Information Retrieval, Statistics, Data Mining
Matching coefficient Dice coefficient Jaccard coefficient Cosine coefficient Overlap coefficient
Earth Mover’s Distance (EMD) [RTG98] Match And Compare (MAC) [IP99]
Join: subset of output result EMD, Overlap coefficient trivially 0 or 1 Others (except MAC) reduce to MAX-subset error
measure
|| BA|)||/(|||2 BABA
||/|| BABA
||||/|| BABA |}||,min{|/|| BABA
![Page 15: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/15.jpg)
Optimization Problem
Select records to be kept in memory such that the result size is maximized subject to memory constraints
Lightweight online technique Adaptivity in presence of memory
fluctuations
![Page 16: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/16.jpg)
Optimal Offline Algorithm
What is the best possible that can be achieved? Optimal sampling strategy for MAX-
subset Bottom-line for evaluation of any online
algorithm Same optimization problem, but knows
future Finite subsets of input streams
Formulate as linear flow problem
![Page 17: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/17.jpg)
Generation of Flow Model
R=1,1,1,3
S=2,3,1,1
M=2, w=3
Fixed memory allocation
3 -3
cost
Capacity: 0..1, linear cost
-1
-1 -1-1
-1
-1
Keep in memory
Replace
![Page 18: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/18.jpg)
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
![Page 19: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/19.jpg)
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
![Page 20: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/20.jpg)
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
-1
-1-1
![Page 21: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/21.jpg)
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
-1
-1-1
-1
-1-1
![Page 22: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/22.jpg)
Complexity
Integer solution exists Optimal solution found in O(n2 m log
n) N input size of single stream #nodes: n < 2wN + N + 2 #arcs: m < 2n + M + 1
Reasonable costs for benchmarking Approx. 1GB memory (w=800, M=800) Approx. 1h computation time
![Page 23: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/23.jpg)
Optimal Flow
R=1,1,1,3
S=2,3,1,1
M=2, w=3
Fixed memory allocation
3 -3
cost
Capacity: 0..1, linear cost
-1
-1 -1-1
-1
-1
Keep in memory
Replace
![Page 24: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/24.jpg)
Easy to Extend
R=1,1,1,3
S=2,3,1,1
M=2, w=3
Variable memory allocation
3 -3
cost
Capacity: 0..1, linear cost
-1
-1 -1-1
-1
-1
Keep in memory
Replace
![Page 25: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/25.jpg)
Online Heuristics
Maximize expected output PROB: sort tuples by join partner arrival
probability LIFE: sort tuples by product of partner
arrival probability and remaining lifetime
Maintain stream statistics Histograms (DGIM02, TGIK02), wavelets
(GKMS01), quantiles (GKMS02, GK01)
![Page 26: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/26.jpg)
Approximation Quality
![Page 27: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/27.jpg)
Effect of Skew
![Page 28: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/28.jpg)
Summary
Information sphere architecture Optimal algorithm and fast efficient
heuristic for sliding window joins Open problems
Other set error measures, resource models Other joins: compress records Complex queries Distributed processing Integration with other techniques into local
information sphere
![Page 29: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/29.jpg)
Related Work
Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI)
Memory requirements [ABBMW02,TM02]
Aggregation Alon, Bar-Yossef, Datar, Dobra,
Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy
![Page 30: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/30.jpg)
Other Results
[DGR03] Integration with archive
Load smoothing, not shedding Novel “error” measure: archive access cost
Static join for sensor networks Maximize result size subject to constraints on
energy consumption Polynomial dynamic programming solution Fast 2-approximation algorithms NP-hardness proof for join of 3 or more streams
![Page 31: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/31.jpg)
Other Results (cont.)
[DGGR02] Computation of aggregates over
streams for multiple joins Small pseudo-random sketch synopses
(randomized linear projections) Explicit, tunable error guarantees Sketch partitioning to boost accuracy
(intelligently partition join attribute space)
![Page 32: Efficient Processing of Massive Data Streams for Mining and Monitoring](https://reader035.vdocuments.mx/reader035/viewer/2022062518/568149b6550346895db6ee08/html5/thumbnails/32.jpg)
Thanks!
Questions?
?
?
?
?
?
?
?