riding the big iot data wave · riding the big iot data wave complex analytics for iot data series...
TRANSCRIPT
27-Apr-17
1
Riding the Big IoT Data Wave Complex Analytics for IoT Data Series
Themis Palpanas Paris Descartes University
Telecom Paristech
Paris, April 2017
References
2
Themis Palpanas - Telecom Paristech, Apr 2017
• papers ▫ ADS: The Adaptive Data Series Index. VLDBJ 2016
http://www.mi.parisdescartes.fr/~themisp/publications/vldbj16-ads.pdf
▫ Big Sequence Management: A Glimpse on the Past, the Present, and the Future. LNCS, 2016
http://www.mi.parisdescartes.fr/~themisp/publications/sofsem16-bisem.pdf
▫ Query Workloads for Data-Series Indexes. KDD 2015
http://www.mi.parisdescartes.fr/~themisp/publications/kdd15-bends.pdf
▫ RINSE: Interactive Data Series Exploration. VLDB 2015
http://www.mi.parisdescartes.fr/~themisp/publications/vldb15-rinse.pdf
▫ Indexing for Interactive Exploration of Big Data Series. SIGMOD 2014
http://www.mi.parisdescartes.fr/~themisp/publications/sigmod14-ads.pdf
▫ Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS 2014
http://www.mi.parisdescartes.fr/~themisp/publications/kais14-isax2plus.pdf
▫ iSAX 2.0: Indexing and Mining One Billion Time Series. ICDM 2010
http://www.mi.parisdescartes.fr/~themisp/publications/icdm10-billiontimeseries.pdf
• code and datasets ▫ http://www.mi.parisdescartes.fr/~themisp/isax2plus/
• data series toolbox ▫ https://github.com/zoumpatianos/DSStat
• demo ▫ http://daslab.seas.harvard.edu/rinse/
27-Apr-17
2
Acknowledgements
• Michele Linardi
• Anna Gogolou
• Botao Peng
• Karia Echihabi
Paris Descartes University
• Alessandro Camerra University of Trento
• Stratos Idreos
• Kostas Zoumpatianos Harvard University
• Yin Lou
• Johannes Gehrke Cornell University
• Jin Shieh
• Eamonn Keogh University of California at Riverside
Themis Palpanas - Telecom Paristech, Apr 2017
3
Executive Summary
• data collected at unprecedented rates
• they enable data-driven scientific discovery
• lots of these data are sequences
▫ takes days-weeks to analyze big sequence collections
4
Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
3
Executive Summary
• data collected at unprecedented rates
• they enable data-driven scientific discovery
• lots of these data are sequences
▫ takes days-weeks to analyze big sequence collections
5
our work: analyze big sequences in minutes/seconds
Themis Palpanas - Telecom Paristech, Apr 2017
Data series
6
Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
4
Data series
7
Themis Palpanas - Telecom Paristech, Apr 2017
• Sequence of points ordered along some dimension
v1 v2
…
sequence dimension
x1 x2
xn va
lue
Data series
8
Themis Palpanas - Telecom Paristech, Apr 2017
• Sequence of points ordered along some dimension
Time
v1 v2
…
sequence dimension
x1 x2
xn va
lue
27-Apr-17
5
Data series
9
Themis Palpanas - Telecom Paristech, Apr 2017
• Sequence of points ordered along some dimension
Time Position
v1 v2
…
sequence dimension
x1 x2
xn va
lue
GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA
Themis Palpanas - Telecom Paristech, Apr 2017
21
Position
27-Apr-17
6
GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA
Themis Palpanas - Telecom Paristech, Apr 2017
22
Position
GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA
Themis Palpanas - Telecom Paristech, Apr 2017
23
Position
27-Apr-17
7
Themis Palpanas - Telecom Paristech, Apr 2017
27
Themis Palpanas - Telecom Paristech, Apr 2017
28
Schinnerer et al.
27-Apr-17
8
Telecommunications
Themis Palpanas - Telecom Paristech, Apr 2017
29
• analysis of call activity patterns
▫ Telecom Italia
clustermap of incoming calls time series 0
10000
20000
30000
40000
50000
60000
13
0 59
88
117
146
175
20
42
33
26
22
91
32
03
49
37
84
07
43
64
65
49
45
23
55
25
81
610
63
96
68
69
7
average number of calls for 5 smallest clusters
call activity for Easter Monday
Time
Home Networks
Themis Palpanas - Telecom Paristech, Apr 2017
30
• temporal usage behavior analysis of home networks
▫ Portugal Telecom
clustering based on user activity patterns (previously unknown) frequent behavior pattern
Time
27-Apr-17
9
32
Operation Health Monitoring
Themis Palpanas - Telecom Paristech, Apr 2017
Time
33
Operation Health Monitoring
Themis Palpanas - Telecom Paristech, Apr 2017
Time
27-Apr-17
10
34
Operation Health Monitoring
Themis Palpanas - Telecom Paristech, Apr 2017
Time
Road Tunnel Monitoring and Control Lamp levels typically statically determined, ignoring environmental
Overprovisioned to meet the regulations
Problems: waste energy and potential security hazard
Idea: place wireless sensors along tunnel, adjust lamps to actual conditions
Eliminate overprovisioning, account for environmental variations
stop distance
Themis Palpanas - Telecom Paristech, Apr 2017 35
27-Apr-17
11
Tunnel length of 630 m,
2 separate 2-lane carriageways,
~28,000 vehicles/day, 90 WSN nodes
Full, operational system
Themis Palpanas - Telecom Paristech, Apr 2017 36
37
Themis Palpanas - Telecom Paristech, Apr 2017
Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. Practical Data Prediction for Real-World Wireless Sensor Networks. TKDE 27(8), 2015
Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?. PerCom 2012. BEST PAPER
27-Apr-17
12
time
Data-Driven Data Acquisition (DDDA)
Typical WSN System Sink gathers all sensor readings of the WSN. Advantage: precise
DDDA/WSN System
Sink predicts sensor readings of the WSN. Advantage: less traffic
38 Themis Palpanas - Telecom Paristech, Apr 2017
What Does Data-Driven Data Acquisition Really Achieve?
DDDA well studied in database community
Depending on scenario, 99% less traffic generated
Does 99% less traffic mean 99% more lifetime?
Current studies look only at application-layer
Full network stack influences lifetime
39
Hardware
MAC
Routing
Application DDDA
Hardware
MAC
Routing
Application
Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
13
Derivative-Based Prediction (DBP)
Sen
so
r valu
e
Time
learning phase prediction phase
edge
points edge
points
δ
DBP: a linear model
Easy to calculate the model
Easy to decide if the sensed data fit the model
40 Themis Palpanas - Telecom Paristech, Apr 2017
Publications
TKDE’15
PERCOM’12
SPRINGER’12
Centralized control algorithm requires only approximate values
Tolerances: sensed values may temporarily exceed these values
without requiring new model generation
Value tolerance – maximum of a percentage and an absolute
Addresses inherent sensor error and variations of low light levels
Time tolerance – in terms of sampling periods
Lamps are adjusted gradually
Tolerance for Prediction Error
Sen
so
r valu
e
Time
value
tolerance
time
tolerance
41 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
14
Test scenario
250 m long tunnel, 40 TMote equivalent nodes, 1 gateway
47 days, samples every 30 seconds, 5.4 million measurements
DBP: parameters established by lighting engineers
40 WSN nodes sink/gateway lamps
Testing the DBP Model
42 Themis Palpanas - Telecom Paristech, Apr 2017
DBP Results: Excellent reduction in data traffic
99.1% messages suppressed at the nodes
Model C
orr
ections
in 5
-min
ute
inte
rvals
Less than 10 model transmissions each 5 min. [instead of 400 samples]
Most models generated during
daylight hours
Few corrections required at night
43 Comparison to other models and with alternate parameters in the paper Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
15
DBP Results: Excellent data approximation
Infrequent models yield data very close to the
actual readings
Deviations from tolerance correspond to model
changes
Value tolerance magnitude calculated based on actual
light values
44 Themis Palpanas - Telecom Paristech, Apr 2017
45
Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
16
Massive Data Series Collections
Human Genome project
130 TB
NASA’s Solar Observatory
1.5 TB per day
Planned Large Synoptic Survey Telescope
~30 TB per night
Themis Palpanas - Telecom Paristech, Apr 2017 46
data center and services monitoring
2B data series 4M points/sec
What do we want to do with them?
- simple query answering
Simlarity Search
select some data series
select values in time interval
select values in some
range
combinations of those
Themis Palpanas - Telecom Paristech, Apr 2017
47
27-Apr-17
17
What do we want to do with them?
- complex analytics
Simlarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
48
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
49
27-Apr-17
18
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
50
sequence
collection
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
51
query
sequence
collection
27-Apr-17
19
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
52
similar sequences
query
sequence
collection
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
53
Euclidean
n
iii yxYXD
1
2,
27-Apr-17
20
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
54
Euclidean
Dynamic Time
Warping (DTW)
n
iii yxYXD
1
2,
)1,1(
),1(
)1,(
min),(
),(),(
jif
jif
jif
yxjif
mnfYXD
ji
dtw
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
55
Euclidean
Dynamic Time
Warping (DTW)
27-Apr-17
21
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
56
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
57
27-Apr-17
22
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
58
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
59
HARD, because of very high dimensionality:
each data series has 100s-1000s of points!
27-Apr-17
23
What do we want to do with them?
- complex analytics
Similarity Search
Classification
Clustering Outlier
Detection
Frequent Pattern Mining
Themis Palpanas - Telecom Paristech, Apr 2017
60
HARD, because of very high dimensionality:
each data series has 100s-1000s of points!
even HARDER, because of very large size: millions to billions of data series (multi-TBs)!
Query answering process
Query Answering Procedure Data Loading Procedure
Raw data
Themis Palpanas - Telecom Paristech, Apr 2017 61
27-Apr-17
24
Query answering process
data-to-query time
Query Answering Procedure Data Loading Procedure
Data Series Database/ Indexing
Data Raw data
Themis Palpanas - Telecom Paristech, Apr 2017 62
Query answering process
data-to-query time query answering time
Query Answering Procedure Data Loading Procedure
Answers
Data Series Database/ Indexing
Data Raw data
Queries
Themis Palpanas - Telecom Paristech, Apr 2017 63
27-Apr-17
25
Query answering process
data-to-query time query answering time
Query Answering Procedure Data Loading Procedure
Answers
Data Series Database/ Indexing
Data Raw data
Queries
Themis Palpanas - Telecom Paristech, Apr 2017 64
these times are big!
Similarity Search via
Serial Scan
Themis Palpanas - Telecom Paristech, Apr 2017 65
27-Apr-17
26
Similarity Search via
Serial Scan
Themis Palpanas - Telecom Paristech, Apr 2017 66
Similarity Search via
Serial Scan
Themis Palpanas - Telecom Paristech, Apr 2017 67
27-Apr-17
27
Similarity Search via
Indexing
Themis Palpanas - Telecom Paristech, Apr 2017 68
Similarity Search via
Indexing
Themis Palpanas - Telecom Paristech, Apr 2017 69
27-Apr-17
28
Similarity Search via
Indexing
Themis Palpanas - Telecom Paristech, Apr 2017 70
Similarity Search via
Indexing
Themis Palpanas - Telecom Paristech, Apr 2017 71
27-Apr-17
29
Themis Palpanas - Telecom Paristech, Apr 2017 72
Traditional Approaches
answer nearest neighbor queries on a 1TB dataset
Themis Palpanas - Telecom Paristech, Apr 2017 73
Traditional Approaches
answer nearest neighbor queries on a 1TB dataset:
Query Answering
serial scan takes 45 minutes/query
27-Apr-17
30
Themis Palpanas - Telecom Paristech, Apr 2017 74
Traditional Approaches
answer nearest neighbor queries on a 1TB dataset:
Query Answering
Query Answering
a data series index can
reduce querying time
serial scan takes 45 minutes/query
Themis Palpanas - Telecom Paristech, Apr 2017 75
Traditional Approaches
answer nearest neighbor queries on a 1TB dataset:
Query Answering
Query Answering
but building the index takes too long!
a data series index can
reduce querying time
serial scan takes 45 minutes/query
27-Apr-17
31
Themis Palpanas - Telecom Paristech, Apr 2017 76
Traditional Approaches
answer nearest neighbor queries on a 1TB dataset:
Query Answering
Query Answering
but building the index takes too long!
indexing a 1TB dataset takes days
a data series index can
reduce querying time
serial scan takes 45 minutes/query
Themis Palpanas - Telecom Paristech, Apr 2017 77
Traditional Approaches
answer nearest neighbor queries on a 1TB dataset:
Query Answering
Query Answering
but building the index takes too long!
indexing a 1TB dataset takes days
complex analytics in hours/days…
a data series index can
reduce querying time
serial scan takes 45 minutes/query
27-Apr-17
32
Themis Palpanas - Telecom Paristech, Apr 2017
Query answering process
data-to-query time query answering time
Query Answering Procedure Data Loading Procedure
Answers
we have proposed the
state-of-the-art solutions for both problems!
Data Series Database/ Indexing
Data Raw data
Queries
78
Themis Palpanas - Telecom Paristech, Apr 2017
Our Approach: ADS+
79
complex analytics in minutes/seconds!
27-Apr-17
33
Outline
• background
▫ SAX representation
▫ iSAX representation
▫ iSAX index
• proposed solution
▫ bulk loading
▫ splitting policy
▫ adaptive solution
• experimental evaluation, case studies
• conclusions, future work, new challenges
Themis Palpanas - Telecom Paristech, Apr 2017
84
-3
-2
-1
0
1
2
3
4 8 12 16 0
A dataa series T SAX Representation
• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n
with w segments using Piecewise Aggregate Approximation (PAA)
Themis Palpanas - Telecom Paristech, Apr 2017
85
27-Apr-17
34
-3
-2
-1
0
1
2
3
4 8 12 16 0
A data series T
4 8 12 16 0
PAA(T,4)
-3
-2
-1
0
1
2
3
SAX Representation
• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n
with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1
PAA(T,w) =
where
wttT ,,1
i
ij
jnw
i
wn
wn
Tt1)1(
Themis Palpanas - Telecom Paristech, Apr 2017
86
-3
-2
-1
0
1
2
3
4 8 12 16 0
00
01
10
11
iSAX(T,4,4)
-3
-2
-1
0
1
2
3
4 8 12 16 0
A data series T
4 8 12 16 0
PAA(T,4)
-3
-2
-1
0
1
2
3
SAX Representation
• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n
with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1
PAA(T,w) =
where
▫ (2) Discretize into a vector of symbols
Breakpoints map to small alphabet a of symbols
wttT ,,1
i
ij
jnw
i
wn
wn
Tt1)1(
Themis Palpanas - Telecom Paristech, Apr 2017
87
27-Apr-17
35
iSAX Representation
• iSAX offers a bit-aware, quantized, multi-resolution representation with variable granularity
= { 6, 6, 3, 0} = {110 ,110 ,0111 ,000}
= { 3, 3, 1, 0} = {11 ,11 ,011 ,00 }
= { 1, 1, 0, 0} = {1 ,1 ,0 ,0 }
Themis Palpanas - Telecom Paristech, Apr 2017
88
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
89
27-Apr-17
36
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
90
e.g., th=4, w=4, b=1
1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
91
1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0
e.g., th=4, w=4, b=1
Insert: 1 1 1 0
27-Apr-17
37
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
92
1 1 10 0 1 1 10 0
1 1 11 0 1 1 11 0
e.g., th=4, w=4, b=1
1 1 11 0
1 1 1 0
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
93
27-Apr-17
38
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
94
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
iSAX Index
Themis Palpanas - Telecom Paristech, Apr 2017
95
27-Apr-17
39
iSAX Index
• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate
▫ base cardinality b (optional), segments w, threshold th
▫ hierarchically subdivides SAX space until num. entries ≤ th
• Approximate Search ▫ Match iSAX representation at each level
• Exact Search ▫ Leverage approximate search
▫ Prune search space
Lower bounding distance
Themis Palpanas - Telecom Paristech, Apr 2017
96
Background
iSAX Index
97
ROOT
. . . 0 0 0 0
1 1 1 0 1 1 1 1
1 1 1 0 1 1 1 0
1 1 1 0
1 1 1 1
1 1 1 1
0 1
0 1
0
Approximate Search
Matches iSAX representation
at each level
Exact Search
Uses a lower bounding function
97 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
40
iSAX Index: Shortcomings & Challenges
• this is a wonderful index!
Themis Palpanas - Telecom Paristech, Apr 2017
98
iSAX Index: Shortcomings & Challenges
• this is a wonderful index!
• … but why does it take soooo long to build for huge datasets?
Themis Palpanas - Telecom Paristech, Apr 2017
99
27-Apr-17
41
iSAX Index: Shortcomings & Challenges
• this is a wonderful index!
• … but why does it take soooo long to build for huge datasets?
• because iSAX implements
▫ a naive node splitting policy
may lead to ineffective splits and additional disk I/O
Themis Palpanas - Telecom Paristech, Apr 2017
100
iSAX Index: Shortcomings & Challenges
• this is a wonderful index!
• … but why does it take soooo long to build for huge datasets?
• because iSAX implements
▫ a naive node splitting policy
may lead to ineffective splits and additional disk I/O
▫ no bulk loading strategy
does not use available main memory to reduce disk I/O
Themis Palpanas - Telecom Paristech, Apr 2017
101
27-Apr-17
42
iSAX 2.0
Bulk Loading Algorithm
• design principles:
▫ take advantage of available main memory
▫ maximize sequential disk accesses
Themis Palpanas - Telecom Paristech, Apr 2017
102
iSAX 2.0
Bulk Loading Algorithm
• intuition for proposed solution:
▫ for each leaf node, collect as many data series that belong to it as possible before materializing the leaf node
the raw values of data series in leaf nodes are written to disk
Themis Palpanas - Telecom Paristech, Apr 2017
103
27-Apr-17
43
iSAX 2.0
Bulk Loading Algorithm
• iterate between two phases (till all data series are indexed):
▫ Phase 1
read data series and group them according to first-level nodes
use all available main memory
▫ Phase 2
grow index by processing the subtree rooted at each one of the first-level nodes one at-a-time
flush leaf node contents to disk using sequential accesses
Themis Palpanas - Telecom Paristech, Apr 2017
104
Publications
ICDM‘10
Themis Palpanas - Telecom Paristech, Apr 2017
105
R
L1 L2 L3
main memory
disk
27-Apr-17
44
Themis Palpanas - Telecom Paristech, Apr 2017
106
R
FBL
L1 L2 L3
main memory
disk
Themis Palpanas - Telecom Paristech, Apr 2017
107
R
FBL
L1 L2 L3
main memory
disk
no limit in the size of FBLs!
27-Apr-17
45
Themis Palpanas - Telecom Paristech, Apr 2017
108
R
insert new ds
FBL
L1 L2 L3
main memory
disk
Themis Palpanas - Telecom Paristech, Apr 2017
109
R
FBL
L1 L2 L3
main memory
disk
27-Apr-17
46
Themis Palpanas - Telecom Paristech, Apr 2017
110
R
L1 L2 I1
FBL
LBL
L4 L3
main memory
disk
Themis Palpanas - Telecom Paristech, Apr 2017
111
R
L1 L2 I1
FBL
LBL
L4 L3
main memory
disk
LBLs have same size as leaf nodes!
27-Apr-17
47
Themis Palpanas - Telecom Paristech, Apr 2017
112
R
L1 L2 I1
FBL
LBL
L4 L3
main memory
disk
Themis Palpanas - Telecom Paristech, Apr 2017
113
R
L1 L2 I1
FBL
LBL
L4 L3
main memory
disk
27-Apr-17
48
Themis Palpanas - Telecom Paristech, Apr 2017
114
R
L1 L2 I1
FBL
LBL
L4 L3
main memory
disk
no extra memory needed!
Themis Palpanas - Telecom Paristech, Apr 2017
115
FBL
LBL
R
L1 L2 I1
L4 L3
main memory
disk
27-Apr-17
49
Themis Palpanas - Telecom Paristech, Apr 2017
116
FBL
LBL
R
L1 L2 I1
L4 L3
main memory
disk
mainly sequential writes!
Experimental Evaluation
Bulk Loading
Themis Palpanas - Telecom Paristech, Apr 2017
131
0
50
100
150
200
250
300
350
400
450
500
100M 200M 300M 400M 500M 800M 900M 1B
Tim
e T
o B
uil
d (
ho
ur
s)
N° Data Series Indexed
iSAX-BufferTree
iSAX
iSAX 2.0
• 1 Billion data series indexed in 16 days: 72% less time
• indexing time per data series: 0.001 sec
27-Apr-17
50
iSAX2+
• intuition for proposed solution:
▫ iSAX grows fast at the beginning of bulk loading, its shape stabilizing well before the end of the process
▫ several data series end up in leaf nodes that never need to split
▫ implement lazy splitting:
move raw data to leaf node the first time
if leaf node splits, do not move raw data until the end of index building process
Themis Palpanas - Telecom Paristech, Apr 2017
148
Publications
KAIS‘14
Experimental Evaluation
Bulk Loading
Themis Palpanas - Telecom Paristech, Apr 2017
167
• 1 Billion data series indexed in 10 days: 82% less time than iSAX
• indexing time per data series: 0.8 milliseconds
0
50
100
150
200
250
300
350
400
450
100M 200M 300M 400M 500M 800M 900M 1B
Tim
e t
o B
uild
( h
ou
rs )
Time Series Indexed
iSAX 2.0
iSAX2+
iSAX 2.0 Clustered
27-Apr-17
51
Drawback of iSAX2+
• cannot start answering queries until entire index is built!
Themis Palpanas - Telecom Paristech, Apr 2017
173
Adaptive Data Series Index:
ADS+
• novel paradigm for building a data series index
▫ do not build entire index and then answer queries
▫ start answering queries by building the part of the index needed by those queries
• still guarantee correct answers
Themis Palpanas - Telecom Paristech, Apr 2017
174
27-Apr-17
52
Adaptive Data Series Index:
ADS+
• intuition for proposed solution ▫ build the iSAX index using the iSAX representations
▫ just like iSAX2+
▫ but start with a large leaf size ▫ minimize initial cost
▫ postpone leaf materialization to query time
▫ only materialize (at query time) leaves needed by queries
▫ parts that are queried more are refined more ▫ use smaller leaf sizes (reduced leaf materialization and query
answering costs) Themis Palpanas - Telecom Paristech, Apr 2017
175
Publications
SIGMOD‘14
VLDBJ‘16
ROOT
I1 I2
LBL
FBL
Raw data
DISK
RAM
Start building an index with only the
iSAX representations
177 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
53
ROOT
I1 I2
LBL
FBL
Raw data
DISK
RAM
Read the data-series one by one from the raw file
178 Themis Palpanas - Telecom Paristech, Apr 2017
ROOT
I1 I2
LBL
FBL
Raw data
DISK
RAM
Convert them to iSAX
179 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
54
ROOT
I1 I2
LBL
FBL
Raw data
DISK
RAM
Store only iSAX in memory (64 times smaller) ~1%
180 Themis Palpanas - Telecom Paristech, Apr 2017
ROOT
I1 I2
LBL
FBL
Raw data
DISK
RAM
Discard raw data and keep pointer to raw file
181 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
55
ROOT
I1
LBL
FBL
Raw data
I2
DISK
RAM
Continue loading data until we run out of memory
182 Themis Palpanas - Telecom Paristech, Apr 2017
ROOT
I1
L3 L4 L1 L2
I2
LBL
FBL
Raw data
DISK
RAM
Expand each sub-tree and move data to LBL
183 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
56
Raw data
ROOT
I1
L3 L4 L1 L2
I2
LBL
FBL
DISK
RAM
We flush the data to the disk to free up memory
184 Themis Palpanas - Telecom Paristech, Apr 2017
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5 L1 L2
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
193 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
57
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5 L1 L2
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
Query #1
194 Themis Palpanas - Telecom Paristech, Apr 2017
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5 L1 L2
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
Query #1
TOO BIG!
195 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
58
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5 L2
L1
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
Query #1
TOO BIG!
196 Themis Palpanas - Telecom Paristech, Apr 2017
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5
I3
L2
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
Query #1
PARTIAL
L5 L4
Adaptive split
Create a smaller leaf
197 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
59
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5
I3
L2
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
Query #1
PARTIAL
L5 L4
Load data in LBL and answer the query
198 Themis Palpanas - Telecom Paristech, Apr 2017
Raw data
PARTIAL
PARTIAL
ROOT
I1
L5
I3
L2
I2
LBL
FBL
PARTIAL
DISK
RAM L4
PARTIAL
FULL
L5 L4
We spill to the disk when we run out of memory
199 Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
60
Experimental Evaluation
Themis Palpanas - Telecom Paristech, Apr 2017
200
• iSAX 2.0 needs more than 35 hours to answer 100K queries
• ADS+ answers 100K queries in less than 5 hours
7x faster
Comparison to
multi-dimensional indices
1-3 orders of magnitude faster than multi-dimensional indexing methods
measure data-to-query time
(just index 1 billion data-series)
1
10
100
1000
10000
Ind
exin
g ti
me
(M
inu
tes)
6.6x faster
40x faster
130x faster
1000x faster
Themis Palpanas - Telecom Paristech, Apr 2017
201
27-Apr-17
61
Demo
demo
Themis Palpanas - Telecom Paristech, Apr 2017
202
• http://www.mi.parisdescartes.fr/~themisp/rinse/
Publications
VLDB‘15
Related Work
• significant amount of work in data series indexing
▫ e.g., TS-Tree [Assent et al. ‘08], iSAX [Shieh & Keogh ‘08]
• none of these approaches
▫ considered bulk loading
▫ examined more than 1 Million data series
• several studies for index bulk loading
▫ merge-based assume data is pre-clustered [Choubey et al. ‘99]
▫ buffering-based work only for balanced indices [Arge et al. ‘02] [Van den Bercken & Seeger ‘01] [Soisalon-Soininen & Widmayer ‘03]
• Adaptive indexing/file reorganization for column stores
▫ Database cracking [Idreos et al. ‘07], raw file cracking [Idreos et al. ’11]
Themis Palpanas - Telecom Paristech, Apr 2017
203
27-Apr-17
62
Distribution/Parallelization/Cloud?
Themis Palpanas - Telecom Paristech, Apr 2017
204
Distribution/Parallelization/Cloud?
• discussion so far assumed a single core
▫ focus on efficient resource utilization
▫ squeeze the most out of a single core
▫ produce scalable solutions at lowest possible cost
also suitable for analysts with no access to/expertise for clusters
Themis Palpanas - Telecom Paristech, Apr 2017
205
27-Apr-17
63
Distribution/Parallelization/Cloud?
• further scale-up and scale-out possible!
▫ techniques inherently parallelizable
across cores, across machines
▫ more involved solutions required when optimizing for energy
minimize total work
Themis Palpanas - Telecom Paristech, Apr 2017
206
0
1 2 … n
compute nodes/threads
First Buffer Layers
Leaf Buffer Layers
subset of collection that contains the answer
parallelizeddata series index
data series collection
Conclusions
• proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+
▫ indexing for very large data series collections
code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/
▫ current state of the art
• experimentally validated proposed approach
▫ first published experiments with 1 Billion data series
Themis Palpanas - Telecom Paristech, Apr 2017
207
27-Apr-17
64
Conclusions
• proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+
▫ indexing for very large data series collections
code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/
▫ current state of the art
• experimentally validated proposed approach
▫ first published experiments with 1 Billion data series
• case studies in diverse domains exhibit usefulness of approach
▫ for the first time enable pain-free analysis of existing, vast collections of data series
Themis Palpanas - Telecom Paristech, Apr 2017
208
What Next?
new challenge: index and mine 10 billion data series
Themis Palpanas - Telecom Paristech, Apr 2017
209
27-Apr-17
65
What Next?
• infrastructure monitoring
▫ Facebook wants to manage 4B data series, 12M new values/sec
Themis Palpanas - Telecom Paristech, Apr 2017
217
What Next?
218
Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
66
219
What Next?
Themis Palpanas - Telecom Paristech, Apr 2017
Themis Palpanas - Telecom Paristech, Apr 2017
What Next?
220
27-Apr-17
67
The Road Ahead
“enable practitioners and non-expert users to easily and efficiently manage and analyze massive data series collections”
Themis Palpanas - Telecom Paristech, Apr 2017
221
The Road Ahead
• Big Sequence Management System
▫ general purpose data series management system
Themis Palpanas - Telecom Paristech, Apr 2017
222
data sequences
27-Apr-17
68
The Road Ahead
• Big Sequence Management System
Themis Palpanas - Telecom Paristech, Apr 2017
223
257
Themis Palpanas - Telecom Paristech, Apr 2017
27-Apr-17
69
Current and Past Collaborations
Infrastructure Monitoring
analysis and mining of hardware
and software infrastructure for
health monitoring
with Facebook, EDF, Safran
Human Behavior Patterns
identification of different social
groups, and analysis of their
macro- and micro-patterns of
behavior
with IBM Research, Telecom Italia
Human Brain Activity
analysis of fully-detailed
neurobiological data for explaining
brain functions
with ICM
Themis Palpanas - Telecom Paristech, Apr
2017
258
Green Manufacturing
analysis and optimization of
manufacturing processes for
energy savings
with SAP, Intel, Volvo, Infineon
eCrime
identification of fraudulent activities
related to the telecommunication
industry
with Telecom Italia, Vodafone, Wind
World Sentiments and Opinions
analysis of aggregate sentiment for
different social groups, role of
media in public sentiment
with Qatar Computing Research
Institute, and Hewlett-Packard Labs
Data-Intensive and Knowledge-Oriented systems