riding the big iot data wave · riding the big iot data wave complex analytics for iot data series...

69
27-Apr-17 1 Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech Paris, April 2017 References 2 Themis Palpanas - Telecom Paristech, Apr 2017 papers ADS: The Adaptive Data Series Index. VLDBJ 2016 http://www.mi.parisdescartes.fr/~themisp/publications/vldbj16-ads.pdf Big Sequence Management: A Glimpse on the Past, the Present, and the Future. LNCS, 2016 http://www.mi.parisdescartes.fr/~themisp/publications/sofsem16-bisem.pdf Query Workloads for Data-Series Indexes. KDD 2015 http://www.mi.parisdescartes.fr/~themisp/publications/kdd15-bends.pdf RINSE: Interactive Data Series Exploration. VLDB 2015 http://www.mi.parisdescartes.fr/~themisp/publications/vldb15-rinse.pdf Indexing for Interactive Exploration of Big Data Series. SIGMOD 2014 http://www.mi.parisdescartes.fr/~themisp/publications/sigmod14-ads.pdf Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS 2014 http://www.mi.parisdescartes.fr/~themisp/publications/kais14-isax2plus.pdf iSAX 2.0: Indexing and Mining One Billion Time Series. ICDM 2010 http://www.mi.parisdescartes.fr/~themisp/publications/icdm10-billiontimeseries.pdf code and datasets http://www.mi.parisdescartes.fr/~themisp/isax2plus/ data series toolbox https://github.com/zoumpatianos/DSStat demo http://daslab.seas.harvard.edu/rinse/

Upload: others

Post on 30-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

1

Riding the Big IoT Data Wave Complex Analytics for IoT Data Series

Themis Palpanas Paris Descartes University

Telecom Paristech

Paris, April 2017

References

2

Themis Palpanas - Telecom Paristech, Apr 2017

• papers ▫ ADS: The Adaptive Data Series Index. VLDBJ 2016

http://www.mi.parisdescartes.fr/~themisp/publications/vldbj16-ads.pdf

▫ Big Sequence Management: A Glimpse on the Past, the Present, and the Future. LNCS, 2016

http://www.mi.parisdescartes.fr/~themisp/publications/sofsem16-bisem.pdf

▫ Query Workloads for Data-Series Indexes. KDD 2015

http://www.mi.parisdescartes.fr/~themisp/publications/kdd15-bends.pdf

▫ RINSE: Interactive Data Series Exploration. VLDB 2015

http://www.mi.parisdescartes.fr/~themisp/publications/vldb15-rinse.pdf

▫ Indexing for Interactive Exploration of Big Data Series. SIGMOD 2014

http://www.mi.parisdescartes.fr/~themisp/publications/sigmod14-ads.pdf

▫ Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS 2014

http://www.mi.parisdescartes.fr/~themisp/publications/kais14-isax2plus.pdf

▫ iSAX 2.0: Indexing and Mining One Billion Time Series. ICDM 2010

http://www.mi.parisdescartes.fr/~themisp/publications/icdm10-billiontimeseries.pdf

• code and datasets ▫ http://www.mi.parisdescartes.fr/~themisp/isax2plus/

• data series toolbox ▫ https://github.com/zoumpatianos/DSStat

• demo ▫ http://daslab.seas.harvard.edu/rinse/

Page 2: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

2

Acknowledgements

• Michele Linardi

• Anna Gogolou

• Botao Peng

• Karia Echihabi

Paris Descartes University

• Alessandro Camerra University of Trento

• Stratos Idreos

• Kostas Zoumpatianos Harvard University

• Yin Lou

• Johannes Gehrke Cornell University

• Jin Shieh

• Eamonn Keogh University of California at Riverside

Themis Palpanas - Telecom Paristech, Apr 2017

3

Executive Summary

• data collected at unprecedented rates

• they enable data-driven scientific discovery

• lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

4

Themis Palpanas - Telecom Paristech, Apr 2017

Page 3: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

3

Executive Summary

• data collected at unprecedented rates

• they enable data-driven scientific discovery

• lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

5

our work: analyze big sequences in minutes/seconds

Themis Palpanas - Telecom Paristech, Apr 2017

Data series

6

Themis Palpanas - Telecom Paristech, Apr 2017

Page 4: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

4

Data series

7

Themis Palpanas - Telecom Paristech, Apr 2017

• Sequence of points ordered along some dimension

v1 v2

sequence dimension

x1 x2

xn va

lue

Data series

8

Themis Palpanas - Telecom Paristech, Apr 2017

• Sequence of points ordered along some dimension

Time

v1 v2

sequence dimension

x1 x2

xn va

lue

Page 5: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

5

Data series

9

Themis Palpanas - Telecom Paristech, Apr 2017

• Sequence of points ordered along some dimension

Time Position

v1 v2

sequence dimension

x1 x2

xn va

lue

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - Telecom Paristech, Apr 2017

21

Position

Page 6: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

6

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - Telecom Paristech, Apr 2017

22

Position

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - Telecom Paristech, Apr 2017

23

Position

Page 7: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

7

Themis Palpanas - Telecom Paristech, Apr 2017

27

Themis Palpanas - Telecom Paristech, Apr 2017

28

Schinnerer et al.

Page 8: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

8

Telecommunications

Themis Palpanas - Telecom Paristech, Apr 2017

29

• analysis of call activity patterns

▫ Telecom Italia

clustermap of incoming calls time series 0

10000

20000

30000

40000

50000

60000

13

0 59

88

117

146

175

20

42

33

26

22

91

32

03

49

37

84

07

43

64

65

49

45

23

55

25

81

610

63

96

68

69

7

average number of calls for 5 smallest clusters

call activity for Easter Monday

Time

Home Networks

Themis Palpanas - Telecom Paristech, Apr 2017

30

• temporal usage behavior analysis of home networks

▫ Portugal Telecom

clustering based on user activity patterns (previously unknown) frequent behavior pattern

Time

Page 9: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

9

32

Operation Health Monitoring

Themis Palpanas - Telecom Paristech, Apr 2017

Time

33

Operation Health Monitoring

Themis Palpanas - Telecom Paristech, Apr 2017

Time

Page 10: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

10

34

Operation Health Monitoring

Themis Palpanas - Telecom Paristech, Apr 2017

Time

Road Tunnel Monitoring and Control Lamp levels typically statically determined, ignoring environmental

Overprovisioned to meet the regulations

Problems: waste energy and potential security hazard

Idea: place wireless sensors along tunnel, adjust lamps to actual conditions

Eliminate overprovisioning, account for environmental variations

stop distance

Themis Palpanas - Telecom Paristech, Apr 2017 35

Page 11: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

11

Tunnel length of 630 m,

2 separate 2-lane carriageways,

~28,000 vehicles/day, 90 WSN nodes

Full, operational system

Themis Palpanas - Telecom Paristech, Apr 2017 36

37

Themis Palpanas - Telecom Paristech, Apr 2017

Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. Practical Data Prediction for Real-World Wireless Sensor Networks. TKDE 27(8), 2015

Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?. PerCom 2012. BEST PAPER

Page 12: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

12

time

Data-Driven Data Acquisition (DDDA)

Typical WSN System Sink gathers all sensor readings of the WSN. Advantage: precise

DDDA/WSN System

Sink predicts sensor readings of the WSN. Advantage: less traffic

38 Themis Palpanas - Telecom Paristech, Apr 2017

What Does Data-Driven Data Acquisition Really Achieve?

DDDA well studied in database community

Depending on scenario, 99% less traffic generated

Does 99% less traffic mean 99% more lifetime?

Current studies look only at application-layer

Full network stack influences lifetime

39

Hardware

MAC

Routing

Application DDDA

Hardware

MAC

Routing

Application

Themis Palpanas - Telecom Paristech, Apr 2017

Page 13: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

13

Derivative-Based Prediction (DBP)

Sen

so

r valu

e

Time

learning phase prediction phase

edge

points edge

points

δ

DBP: a linear model

Easy to calculate the model

Easy to decide if the sensed data fit the model

40 Themis Palpanas - Telecom Paristech, Apr 2017

Publications

TKDE’15

PERCOM’12

SPRINGER’12

Centralized control algorithm requires only approximate values

Tolerances: sensed values may temporarily exceed these values

without requiring new model generation

Value tolerance – maximum of a percentage and an absolute

Addresses inherent sensor error and variations of low light levels

Time tolerance – in terms of sampling periods

Lamps are adjusted gradually

Tolerance for Prediction Error

Sen

so

r valu

e

Time

value

tolerance

time

tolerance

41 Themis Palpanas - Telecom Paristech, Apr 2017

Page 14: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

14

Test scenario

250 m long tunnel, 40 TMote equivalent nodes, 1 gateway

47 days, samples every 30 seconds, 5.4 million measurements

DBP: parameters established by lighting engineers

40 WSN nodes sink/gateway lamps

Testing the DBP Model

42 Themis Palpanas - Telecom Paristech, Apr 2017

DBP Results: Excellent reduction in data traffic

99.1% messages suppressed at the nodes

Model C

orr

ections

in 5

-min

ute

inte

rvals

Less than 10 model transmissions each 5 min. [instead of 400 samples]

Most models generated during

daylight hours

Few corrections required at night

43 Comparison to other models and with alternate parameters in the paper Themis Palpanas - Telecom Paristech, Apr 2017

Page 15: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

15

DBP Results: Excellent data approximation

Infrequent models yield data very close to the

actual readings

Deviations from tolerance correspond to model

changes

Value tolerance magnitude calculated based on actual

light values

44 Themis Palpanas - Telecom Paristech, Apr 2017

45

Themis Palpanas - Telecom Paristech, Apr 2017

Page 16: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

16

Massive Data Series Collections

Human Genome project

130 TB

NASA’s Solar Observatory

1.5 TB per day

Planned Large Synoptic Survey Telescope

~30 TB per night

Themis Palpanas - Telecom Paristech, Apr 2017 46

data center and services monitoring

2B data series 4M points/sec

What do we want to do with them?

- simple query answering

Simlarity Search

select some data series

select values in time interval

select values in some

range

combinations of those

Themis Palpanas - Telecom Paristech, Apr 2017

47

Page 17: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

17

What do we want to do with them?

- complex analytics

Simlarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

48

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

49

Page 18: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

18

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

50

sequence

collection

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

51

query

sequence

collection

Page 19: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

19

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

52

similar sequences

query

sequence

collection

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

53

Euclidean

n

iii yxYXD

1

2,

Page 20: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

20

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

54

Euclidean

Dynamic Time

Warping (DTW)

n

iii yxYXD

1

2,

)1,1(

),1(

)1,(

min),(

),(),(

jif

jif

jif

yxjif

mnfYXD

ji

dtw

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

55

Euclidean

Dynamic Time

Warping (DTW)

Page 21: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

21

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

56

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

57

Page 22: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

22

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

58

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

59

HARD, because of very high dimensionality:

each data series has 100s-1000s of points!

Page 23: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

23

What do we want to do with them?

- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

60

HARD, because of very high dimensionality:

each data series has 100s-1000s of points!

even HARDER, because of very large size: millions to billions of data series (multi-TBs)!

Query answering process

Query Answering Procedure Data Loading Procedure

Raw data

Themis Palpanas - Telecom Paristech, Apr 2017 61

Page 24: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

24

Query answering process

data-to-query time

Query Answering Procedure Data Loading Procedure

Data Series Database/ Indexing

Data Raw data

Themis Palpanas - Telecom Paristech, Apr 2017 62

Query answering process

data-to-query time query answering time

Query Answering Procedure Data Loading Procedure

Answers

Data Series Database/ Indexing

Data Raw data

Queries

Themis Palpanas - Telecom Paristech, Apr 2017 63

Page 25: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

25

Query answering process

data-to-query time query answering time

Query Answering Procedure Data Loading Procedure

Answers

Data Series Database/ Indexing

Data Raw data

Queries

Themis Palpanas - Telecom Paristech, Apr 2017 64

these times are big!

Similarity Search via

Serial Scan

Themis Palpanas - Telecom Paristech, Apr 2017 65

Page 26: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

26

Similarity Search via

Serial Scan

Themis Palpanas - Telecom Paristech, Apr 2017 66

Similarity Search via

Serial Scan

Themis Palpanas - Telecom Paristech, Apr 2017 67

Page 27: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

27

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 68

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 69

Page 28: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

28

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 70

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 71

Page 29: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

29

Themis Palpanas - Telecom Paristech, Apr 2017 72

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset

Themis Palpanas - Telecom Paristech, Apr 2017 73

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

serial scan takes 45 minutes/query

Page 30: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

30

Themis Palpanas - Telecom Paristech, Apr 2017 74

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

Query Answering

a data series index can

reduce querying time

serial scan takes 45 minutes/query

Themis Palpanas - Telecom Paristech, Apr 2017 75

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

Query Answering

but building the index takes too long!

a data series index can

reduce querying time

serial scan takes 45 minutes/query

Page 31: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

31

Themis Palpanas - Telecom Paristech, Apr 2017 76

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

Query Answering

but building the index takes too long!

indexing a 1TB dataset takes days

a data series index can

reduce querying time

serial scan takes 45 minutes/query

Themis Palpanas - Telecom Paristech, Apr 2017 77

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

Query Answering

but building the index takes too long!

indexing a 1TB dataset takes days

complex analytics in hours/days…

a data series index can

reduce querying time

serial scan takes 45 minutes/query

Page 32: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

32

Themis Palpanas - Telecom Paristech, Apr 2017

Query answering process

data-to-query time query answering time

Query Answering Procedure Data Loading Procedure

Answers

we have proposed the

state-of-the-art solutions for both problems!

Data Series Database/ Indexing

Data Raw data

Queries

78

Themis Palpanas - Telecom Paristech, Apr 2017

Our Approach: ADS+

79

complex analytics in minutes/seconds!

Page 33: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

33

Outline

• background

▫ SAX representation

▫ iSAX representation

▫ iSAX index

• proposed solution

▫ bulk loading

▫ splitting policy

▫ adaptive solution

• experimental evaluation, case studies

• conclusions, future work, new challenges

Themis Palpanas - Telecom Paristech, Apr 2017

84

-3

-2

-1

0

1

2

3

4 8 12 16 0

A dataa series T SAX Representation

• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n

with w segments using Piecewise Aggregate Approximation (PAA)

Themis Palpanas - Telecom Paristech, Apr 2017

85

Page 34: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

34

-3

-2

-1

0

1

2

3

4 8 12 16 0

A data series T

4 8 12 16 0

PAA(T,4)

-3

-2

-1

0

1

2

3

SAX Representation

• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n

with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1

PAA(T,w) =

where

wttT ,,1

i

ij

jnw

i

wn

wn

Tt1)1(

Themis Palpanas - Telecom Paristech, Apr 2017

86

-3

-2

-1

0

1

2

3

4 8 12 16 0

00

01

10

11

iSAX(T,4,4)

-3

-2

-1

0

1

2

3

4 8 12 16 0

A data series T

4 8 12 16 0

PAA(T,4)

-3

-2

-1

0

1

2

3

SAX Representation

• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n

with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1

PAA(T,w) =

where

▫ (2) Discretize into a vector of symbols

Breakpoints map to small alphabet a of symbols

wttT ,,1

i

ij

jnw

i

wn

wn

Tt1)1(

Themis Palpanas - Telecom Paristech, Apr 2017

87

Page 35: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

35

iSAX Representation

• iSAX offers a bit-aware, quantized, multi-resolution representation with variable granularity

= { 6, 6, 3, 0} = {110 ,110 ,0111 ,000}

= { 3, 3, 1, 0} = {11 ,11 ,011 ,00 }

= { 1, 1, 0, 0} = {1 ,1 ,0 ,0 }

Themis Palpanas - Telecom Paristech, Apr 2017

88

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

89

Page 36: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

36

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

90

e.g., th=4, w=4, b=1

1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

91

1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0

e.g., th=4, w=4, b=1

Insert: 1 1 1 0

Page 37: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

37

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

92

1 1 10 0 1 1 10 0

1 1 11 0 1 1 11 0

e.g., th=4, w=4, b=1

1 1 11 0

1 1 1 0

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

93

Page 38: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

38

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

94

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

95

Page 39: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

39

iSAX Index

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

• Approximate Search ▫ Match iSAX representation at each level

• Exact Search ▫ Leverage approximate search

▫ Prune search space

Lower bounding distance

Themis Palpanas - Telecom Paristech, Apr 2017

96

Background

iSAX Index

97

ROOT

. . . 0 0 0 0

1 1 1 0 1 1 1 1

1 1 1 0 1 1 1 0

1 1 1 0

1 1 1 1

1 1 1 1

0 1

0 1

0

Approximate Search

Matches iSAX representation

at each level

Exact Search

Uses a lower bounding function

97 Themis Palpanas - Telecom Paristech, Apr 2017

Page 40: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

40

iSAX Index: Shortcomings & Challenges

• this is a wonderful index!

Themis Palpanas - Telecom Paristech, Apr 2017

98

iSAX Index: Shortcomings & Challenges

• this is a wonderful index!

• … but why does it take soooo long to build for huge datasets?

Themis Palpanas - Telecom Paristech, Apr 2017

99

Page 41: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

41

iSAX Index: Shortcomings & Challenges

• this is a wonderful index!

• … but why does it take soooo long to build for huge datasets?

• because iSAX implements

▫ a naive node splitting policy

may lead to ineffective splits and additional disk I/O

Themis Palpanas - Telecom Paristech, Apr 2017

100

iSAX Index: Shortcomings & Challenges

• this is a wonderful index!

• … but why does it take soooo long to build for huge datasets?

• because iSAX implements

▫ a naive node splitting policy

may lead to ineffective splits and additional disk I/O

▫ no bulk loading strategy

does not use available main memory to reduce disk I/O

Themis Palpanas - Telecom Paristech, Apr 2017

101

Page 42: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

42

iSAX 2.0

Bulk Loading Algorithm

• design principles:

▫ take advantage of available main memory

▫ maximize sequential disk accesses

Themis Palpanas - Telecom Paristech, Apr 2017

102

iSAX 2.0

Bulk Loading Algorithm

• intuition for proposed solution:

▫ for each leaf node, collect as many data series that belong to it as possible before materializing the leaf node

the raw values of data series in leaf nodes are written to disk

Themis Palpanas - Telecom Paristech, Apr 2017

103

Page 43: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

43

iSAX 2.0

Bulk Loading Algorithm

• iterate between two phases (till all data series are indexed):

▫ Phase 1

read data series and group them according to first-level nodes

use all available main memory

▫ Phase 2

grow index by processing the subtree rooted at each one of the first-level nodes one at-a-time

flush leaf node contents to disk using sequential accesses

Themis Palpanas - Telecom Paristech, Apr 2017

104

Publications

ICDM‘10

Themis Palpanas - Telecom Paristech, Apr 2017

105

R

L1 L2 L3

main memory

disk

Page 44: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

44

Themis Palpanas - Telecom Paristech, Apr 2017

106

R

FBL

L1 L2 L3

main memory

disk

Themis Palpanas - Telecom Paristech, Apr 2017

107

R

FBL

L1 L2 L3

main memory

disk

no limit in the size of FBLs!

Page 45: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

45

Themis Palpanas - Telecom Paristech, Apr 2017

108

R

insert new ds

FBL

L1 L2 L3

main memory

disk

Themis Palpanas - Telecom Paristech, Apr 2017

109

R

FBL

L1 L2 L3

main memory

disk

Page 46: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

46

Themis Palpanas - Telecom Paristech, Apr 2017

110

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

Themis Palpanas - Telecom Paristech, Apr 2017

111

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

LBLs have same size as leaf nodes!

Page 47: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

47

Themis Palpanas - Telecom Paristech, Apr 2017

112

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

Themis Palpanas - Telecom Paristech, Apr 2017

113

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

Page 48: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

48

Themis Palpanas - Telecom Paristech, Apr 2017

114

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

no extra memory needed!

Themis Palpanas - Telecom Paristech, Apr 2017

115

FBL

LBL

R

L1 L2 I1

L4 L3

main memory

disk

Page 49: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

49

Themis Palpanas - Telecom Paristech, Apr 2017

116

FBL

LBL

R

L1 L2 I1

L4 L3

main memory

disk

mainly sequential writes!

Experimental Evaluation

Bulk Loading

Themis Palpanas - Telecom Paristech, Apr 2017

131

0

50

100

150

200

250

300

350

400

450

500

100M 200M 300M 400M 500M 800M 900M 1B

Tim

e T

o B

uil

d (

ho

ur

s)

N° Data Series Indexed

iSAX-BufferTree

iSAX

iSAX 2.0

• 1 Billion data series indexed in 16 days: 72% less time

• indexing time per data series: 0.001 sec

Page 50: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

50

iSAX2+

• intuition for proposed solution:

▫ iSAX grows fast at the beginning of bulk loading, its shape stabilizing well before the end of the process

▫ several data series end up in leaf nodes that never need to split

▫ implement lazy splitting:

move raw data to leaf node the first time

if leaf node splits, do not move raw data until the end of index building process

Themis Palpanas - Telecom Paristech, Apr 2017

148

Publications

KAIS‘14

Experimental Evaluation

Bulk Loading

Themis Palpanas - Telecom Paristech, Apr 2017

167

• 1 Billion data series indexed in 10 days: 82% less time than iSAX

• indexing time per data series: 0.8 milliseconds

0

50

100

150

200

250

300

350

400

450

100M 200M 300M 400M 500M 800M 900M 1B

Tim

e t

o B

uild

( h

ou

rs )

Time Series Indexed

iSAX 2.0

iSAX2+

iSAX 2.0 Clustered

Page 51: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

51

Drawback of iSAX2+

• cannot start answering queries until entire index is built!

Themis Palpanas - Telecom Paristech, Apr 2017

173

Adaptive Data Series Index:

ADS+

• novel paradigm for building a data series index

▫ do not build entire index and then answer queries

▫ start answering queries by building the part of the index needed by those queries

• still guarantee correct answers

Themis Palpanas - Telecom Paristech, Apr 2017

174

Page 52: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

52

Adaptive Data Series Index:

ADS+

• intuition for proposed solution ▫ build the iSAX index using the iSAX representations

▫ just like iSAX2+

▫ but start with a large leaf size ▫ minimize initial cost

▫ postpone leaf materialization to query time

▫ only materialize (at query time) leaves needed by queries

▫ parts that are queried more are refined more ▫ use smaller leaf sizes (reduced leaf materialization and query

answering costs) Themis Palpanas - Telecom Paristech, Apr 2017

175

Publications

SIGMOD‘14

VLDBJ‘16

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Start building an index with only the

iSAX representations

177 Themis Palpanas - Telecom Paristech, Apr 2017

Page 53: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

53

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Read the data-series one by one from the raw file

178 Themis Palpanas - Telecom Paristech, Apr 2017

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Convert them to iSAX

179 Themis Palpanas - Telecom Paristech, Apr 2017

Page 54: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

54

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Store only iSAX in memory (64 times smaller) ~1%

180 Themis Palpanas - Telecom Paristech, Apr 2017

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Discard raw data and keep pointer to raw file

181 Themis Palpanas - Telecom Paristech, Apr 2017

Page 55: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

55

ROOT

I1

LBL

FBL

Raw data

I2

DISK

RAM

Continue loading data until we run out of memory

182 Themis Palpanas - Telecom Paristech, Apr 2017

ROOT

I1

L3 L4 L1 L2

I2

LBL

FBL

Raw data

DISK

RAM

Expand each sub-tree and move data to LBL

183 Themis Palpanas - Telecom Paristech, Apr 2017

Page 56: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

56

Raw data

ROOT

I1

L3 L4 L1 L2

I2

LBL

FBL

DISK

RAM

We flush the data to the disk to free up memory

184 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

193 Themis Palpanas - Telecom Paristech, Apr 2017

Page 57: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

57

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

194 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

TOO BIG!

195 Themis Palpanas - Telecom Paristech, Apr 2017

Page 58: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

58

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L2

L1

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

TOO BIG!

196 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

PARTIAL

L5 L4

Adaptive split

Create a smaller leaf

197 Themis Palpanas - Telecom Paristech, Apr 2017

Page 59: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

59

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

PARTIAL

L5 L4

Load data in LBL and answer the query

198 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

FULL

L5 L4

We spill to the disk when we run out of memory

199 Themis Palpanas - Telecom Paristech, Apr 2017

Page 60: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

60

Experimental Evaluation

Themis Palpanas - Telecom Paristech, Apr 2017

200

• iSAX 2.0 needs more than 35 hours to answer 100K queries

• ADS+ answers 100K queries in less than 5 hours

7x faster

Comparison to

multi-dimensional indices

1-3 orders of magnitude faster than multi-dimensional indexing methods

measure data-to-query time

(just index 1 billion data-series)

1

10

100

1000

10000

Ind

exin

g ti

me

(M

inu

tes)

6.6x faster

40x faster

130x faster

1000x faster

Themis Palpanas - Telecom Paristech, Apr 2017

201

Page 61: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

61

Demo

demo

Themis Palpanas - Telecom Paristech, Apr 2017

202

• http://www.mi.parisdescartes.fr/~themisp/rinse/

Publications

VLDB‘15

Related Work

• significant amount of work in data series indexing

▫ e.g., TS-Tree [Assent et al. ‘08], iSAX [Shieh & Keogh ‘08]

• none of these approaches

▫ considered bulk loading

▫ examined more than 1 Million data series

• several studies for index bulk loading

▫ merge-based assume data is pre-clustered [Choubey et al. ‘99]

▫ buffering-based work only for balanced indices [Arge et al. ‘02] [Van den Bercken & Seeger ‘01] [Soisalon-Soininen & Widmayer ‘03]

• Adaptive indexing/file reorganization for column stores

▫ Database cracking [Idreos et al. ‘07], raw file cracking [Idreos et al. ’11]

Themis Palpanas - Telecom Paristech, Apr 2017

203

Page 62: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

62

Distribution/Parallelization/Cloud?

Themis Palpanas - Telecom Paristech, Apr 2017

204

Distribution/Parallelization/Cloud?

• discussion so far assumed a single core

▫ focus on efficient resource utilization

▫ squeeze the most out of a single core

▫ produce scalable solutions at lowest possible cost

also suitable for analysts with no access to/expertise for clusters

Themis Palpanas - Telecom Paristech, Apr 2017

205

Page 63: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

63

Distribution/Parallelization/Cloud?

• further scale-up and scale-out possible!

▫ techniques inherently parallelizable

across cores, across machines

▫ more involved solutions required when optimizing for energy

minimize total work

Themis Palpanas - Telecom Paristech, Apr 2017

206

0

1 2 … n

compute nodes/threads

First Buffer Layers

Leaf Buffer Layers

subset of collection that contains the answer

parallelizeddata series index

data series collection

Conclusions

• proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+

▫ indexing for very large data series collections

code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/

▫ current state of the art

• experimentally validated proposed approach

▫ first published experiments with 1 Billion data series

Themis Palpanas - Telecom Paristech, Apr 2017

207

Page 64: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

64

Conclusions

• proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+

▫ indexing for very large data series collections

code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/

▫ current state of the art

• experimentally validated proposed approach

▫ first published experiments with 1 Billion data series

• case studies in diverse domains exhibit usefulness of approach

▫ for the first time enable pain-free analysis of existing, vast collections of data series

Themis Palpanas - Telecom Paristech, Apr 2017

208

What Next?

new challenge: index and mine 10 billion data series

Themis Palpanas - Telecom Paristech, Apr 2017

209

Page 65: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

65

What Next?

• infrastructure monitoring

▫ Facebook wants to manage 4B data series, 12M new values/sec

Themis Palpanas - Telecom Paristech, Apr 2017

217

What Next?

218

Themis Palpanas - Telecom Paristech, Apr 2017

Page 66: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

66

219

What Next?

Themis Palpanas - Telecom Paristech, Apr 2017

Themis Palpanas - Telecom Paristech, Apr 2017

What Next?

220

Page 67: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

67

The Road Ahead

“enable practitioners and non-expert users to easily and efficiently manage and analyze massive data series collections”

Themis Palpanas - Telecom Paristech, Apr 2017

221

The Road Ahead

• Big Sequence Management System

▫ general purpose data series management system

Themis Palpanas - Telecom Paristech, Apr 2017

222

data sequences

Page 68: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

68

The Road Ahead

• Big Sequence Management System

Themis Palpanas - Telecom Paristech, Apr 2017

223

257

Themis Palpanas - Telecom Paristech, Apr 2017

Page 69: Riding the Big IoT Data Wave · Riding the Big IoT Data Wave Complex Analytics for IoT Data Series Themis Palpanas Paris Descartes University Telecom Paristech ... NASA’s Solar

27-Apr-17

69

Current and Past Collaborations

Infrastructure Monitoring

analysis and mining of hardware

and software infrastructure for

health monitoring

with Facebook, EDF, Safran

Human Behavior Patterns

identification of different social

groups, and analysis of their

macro- and micro-patterns of

behavior

with IBM Research, Telecom Italia

Human Brain Activity

analysis of fully-detailed

neurobiological data for explaining

brain functions

with ICM

Themis Palpanas - Telecom Paristech, Apr

2017

258

Green Manufacturing

analysis and optimization of

manufacturing processes for

energy savings

with SAP, Intel, Volvo, Infineon

eCrime

identification of fraudulent activities

related to the telecommunication

industry

with Telecom Italia, Vodafone, Wind

World Sentiments and Opinions

analysis of aggregate sentiment for

different social groups, role of

media in public sentiment

with Qatar Computing Research

Institute, and Hewlett-Packard Labs

Data-Intensive and Knowledge-Oriented systems