riding the big iot data wave · riding the big iot data wave complex analytics for iot data series...

27-Apr-17

1

Riding the Big IoT Data Wave Complex Analytics for IoT Data Series

Themis Palpanas Paris Descartes University

Telecom Paristech

Paris, April 2017

References

2

Themis Palpanas - Telecom Paristech, Apr 2017

• papers ▫ ADS: The Adaptive Data Series Index. VLDBJ 2016

http://www.mi.parisdescartes.fr/~themisp/publications/vldbj16-ads.pdf

▫ Big Sequence Management: A Glimpse on the Past, the Present, and the Future. LNCS, 2016

http://www.mi.parisdescartes.fr/~themisp/publications/sofsem16-bisem.pdf

▫ Query Workloads for Data-Series Indexes. KDD 2015

http://www.mi.parisdescartes.fr/~themisp/publications/kdd15-bends.pdf

▫ RINSE: Interactive Data Series Exploration. VLDB 2015

http://www.mi.parisdescartes.fr/~themisp/publications/vldb15-rinse.pdf

▫ Indexing for Interactive Exploration of Big Data Series. SIGMOD 2014

http://www.mi.parisdescartes.fr/~themisp/publications/sigmod14-ads.pdf

▫ Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS 2014

http://www.mi.parisdescartes.fr/~themisp/publications/kais14-isax2plus.pdf

▫ iSAX 2.0: Indexing and Mining One Billion Time Series. ICDM 2010

http://www.mi.parisdescartes.fr/~themisp/publications/icdm10-billiontimeseries.pdf

• code and datasets ▫ http://www.mi.parisdescartes.fr/~themisp/isax2plus/

• data series toolbox ▫ https://github.com/zoumpatianos/DSStat

• demo ▫ http://daslab.seas.harvard.edu/rinse/




























http://www.mi.parisdescartes.fr/~themisp/isax2plus/


https://github.com/zoumpatianos/DSStat

https://github.com/zoumpatianos/DSStat

http://daslab.seas.harvard.edu/rinse/


27-Apr-17

2

Acknowledgements

• Michele Linardi

• Anna Gogolou

• Botao Peng

• Karia Echihabi

Paris Descartes University

• Alessandro Camerra University of Trento

• Stratos Idreos

• Kostas Zoumpatianos Harvard University

• Yin Lou

• Johannes Gehrke Cornell University

• Jin Shieh

• Eamonn Keogh University of California at Riverside


3

Executive Summary

• data collected at unprecedented rates

• they enable data-driven scientific discovery

• lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

4


27-Apr-17

3

Executive Summary

• data collected at unprecedented rates

• they enable data-driven scientific discovery

• lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

5

our work: analyze big sequences in minutes/seconds


Data series

6


27-Apr-17

4

Data series

7


• Sequence of points ordered along some dimension

v1 v2

…

sequence dimension

x1 x2

xn va

lue

Data series

8



Time

v1 v2

…

sequence dimension

x1 x2

xn va

lue

27-Apr-17

5

Data series

9



Time Position

v1 v2

…

sequence dimension

x1 x2

xn va

lue

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA


21

Position

27-Apr-17

6



22

Position



23

Position

27-Apr-17

7


27


28

Schinnerer et al.

27-Apr-17

8

Telecommunications


29

• analysis of call activity patterns

▫ Telecom Italia

clustermap of incoming calls time series 0

10000

20000

30000

40000

50000

60000

13

0 59

88

117

146

175

20

42

33

26

22

91

32

03

49

37

84

07

43

64

65

49

45

23

55

25

81

610

63

96

68

69

7

average number of calls for 5 smallest clusters

call activity for Easter Monday

Time

Home Networks


30

• temporal usage behavior analysis of home networks

▫ Portugal Telecom

clustering based on user activity patterns (previously unknown) frequent behavior pattern

Time

27-Apr-17

9

32

Operation Health Monitoring


Time

33



Time

27-Apr-17

10

34



Time

Road Tunnel Monitoring and Control Lamp levels typically statically determined, ignoring environmental

Overprovisioned to meet the regulations

Problems: waste energy and potential security hazard

Idea: place wireless sensors along tunnel, adjust lamps to actual conditions

Eliminate overprovisioning, account for environmental variations

stop distance

Themis Palpanas - Telecom Paristech, Apr 2017 35

27-Apr-17

11

Tunnel length of 630 m,

2 separate 2-lane carriageways,

~28,000 vehicles/day, 90 WSN nodes

Full, operational system


37


Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. Practical Data Prediction for Real-World Wireless Sensor Networks. TKDE 27(8), 2015

Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?. PerCom 2012. BEST PAPER

27-Apr-17

12

time

Data-Driven Data Acquisition (DDDA)

Typical WSN System Sink gathers all sensor readings of the WSN. Advantage: precise

DDDA/WSN System

Sink predicts sensor readings of the WSN. Advantage: less traffic

38 Themis Palpanas - Telecom Paristech, Apr 2017

What Does Data-Driven Data Acquisition Really Achieve?

DDDA well studied in database community

Depending on scenario, 99% less traffic generated

Does 99% less traffic mean 99% more lifetime?

Current studies look only at application-layer

Full network stack influences lifetime

39

Hardware

MAC

Routing

Application DDDA

Hardware

MAC

Routing

Application


27-Apr-17

13

Derivative-Based Prediction (DBP)

Sen

so

r valu

e

Time

learning phase prediction phase

edge

points edge

points

δ

DBP: a linear model

Easy to calculate the model

Easy to decide if the sensed data fit the model


Publications

TKDE’15

PERCOM’12

SPRINGER’12

Centralized control algorithm requires only approximate values

Tolerances: sensed values may temporarily exceed these values

without requiring new model generation

Value tolerance – maximum of a percentage and an absolute

Addresses inherent sensor error and variations of low light levels

Time tolerance – in terms of sampling periods

Lamps are adjusted gradually

Tolerance for Prediction Error

Sen

so

r valu

e

Time

value

tolerance

time

tolerance


27-Apr-17

14

Test scenario

250 m long tunnel, 40 TMote equivalent nodes, 1 gateway

47 days, samples every 30 seconds, 5.4 million measurements

DBP: parameters established by lighting engineers

40 WSN nodes sink/gateway lamps

Testing the DBP Model


DBP Results: Excellent reduction in data traffic

99.1% messages suppressed at the nodes

Model C

orr

ections

in 5

-min

ute

inte

rvals

Less than 10 model transmissions each 5 min. [instead of 400 samples]

Most models generated during

daylight hours

Few corrections required at night

43 Comparison to other models and with alternate parameters in the paper Themis Palpanas - Telecom Paristech, Apr 2017

27-Apr-17

15

DBP Results: Excellent data approximation

Infrequent models yield data very close to the

actual readings

Deviations from tolerance correspond to model

changes

Value tolerance magnitude calculated based on actual

light values


45


27-Apr-17

16

Massive Data Series Collections

Human Genome project

130 TB

NASA’s Solar Observatory

1.5 TB per day

Planned Large Synoptic Survey Telescope

~30 TB per night


data center and services monitoring

2B data series 4M points/sec

What do we want to do with them?

- simple query answering

Simlarity Search

select some data series

select values in time interval

select values in some

range

combinations of those


47

27-Apr-17

17


- complex analytics

Simlarity Search

Classification

Clustering Outlier

Detection

Frequent Pattern Mining


48


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



49

27-Apr-17

18


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



50

sequence

collection


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



51

query

sequence

collection

27-Apr-17

19


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



52

similar sequences

query

sequence

collection


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



53

Euclidean

n

iii yxYXD

1

2,

27-Apr-17

20


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



54

Euclidean

Dynamic Time

Warping (DTW)

n

iii yxYXD

1

2,

)1,1(

),1(

)1,(

min),(

),(),(

jif

jif

jif

yxjif

mnfYXD

ji

dtw


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



55

Euclidean

Dynamic Time

Warping (DTW)

27-Apr-17

21


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



56


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



57

27-Apr-17

22


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



58


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



59

HARD, because of very high dimensionality:

each data series has 100s-1000s of points!

27-Apr-17

23


- complex analytics

Similarity Search

Classification

Clustering Outlier

Detection



60

HARD, because of very high dimensionality:

each data series has 100s-1000s of points!

even HARDER, because of very large size: millions to billions of data series (multi-TBs)!

Query answering process

Query Answering Procedure Data Loading Procedure

Raw data


27-Apr-17

24


data-to-query time


Data Series Database/ Indexing

Data Raw data



data-to-query time query answering time


Answers


Data Raw data

Queries


27-Apr-17

25




Answers


Data Raw data

Queries


these times are big!

Similarity Search via

Serial Scan


27-Apr-17

26


Serial Scan



Serial Scan


27-Apr-17

27


Indexing



Indexing


27-Apr-17

28


Indexing



Indexing


27-Apr-17

29


Traditional Approaches

answer nearest neighbor queries on a 1TB dataset



answer nearest neighbor queries on a 1TB dataset:

Query Answering

serial scan takes 45 minutes/query

27-Apr-17

30




Query Answering

Query Answering

a data series index can

reduce querying time





Query Answering

Query Answering

but building the index takes too long!




27-Apr-17

31




Query Answering

Query Answering


indexing a 1TB dataset takes days







Query Answering

Query Answering


indexing a 1TB dataset takes days

complex analytics in hours/days…




27-Apr-17

32





Answers

we have proposed the

state-of-the-art solutions for both problems!


Data Raw data

Queries

78


Our Approach: ADS+

79

complex analytics in minutes/seconds!

27-Apr-17

33

Outline

• background

▫ SAX representation

▫ iSAX representation

▫ iSAX index

• proposed solution

▫ bulk loading

▫ splitting policy

▫ adaptive solution

• experimental evaluation, case studies

• conclusions, future work, new challenges


84

-3

-2

-1

0

1

2

3

4 8 12 16 0

A dataa series T SAX Representation

• Symbolic Aggregate approXimation (SAX) ▫ (1) Represent data series T of length n

with w segments using Piecewise Aggregate Approximation (PAA)


85

27-Apr-17

34

-3

-2

-1

0

1

2

3

4 8 12 16 0

A data series T

4 8 12 16 0

PAA(T,4)

-3

-2

-1

0

1

2

3

SAX Representation


with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1

PAA(T,w) =

where

wttT ,,1

i

ij

jnw

i

wn

wn

Tt1)1(


86

-3

-2

-1

0

1

2

3

4 8 12 16 0

00

01

10

11

iSAX(T,4,4)

-3

-2

-1

0

1

2

3

4 8 12 16 0

A data series T

4 8 12 16 0

PAA(T,4)

-3

-2

-1

0

1

2

3

SAX Representation


with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1

PAA(T,w) =

where

▫ (2) Discretize into a vector of symbols

Breakpoints map to small alphabet a of symbols

wttT ,,1

i

ij

jnw

i

wn

wn

Tt1)1(


87

27-Apr-17

35

iSAX Representation

• iSAX offers a bit-aware, quantized, multi-resolution representation with variable granularity

= { 6, 6, 3, 0} = {110 ,110 ,0111 ,000}

= { 3, 3, 1, 0} = {11 ,11 ,011 ,00 }

= { 1, 1, 0, 0} = {1 ,1 ,0 ,0 }


88

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index


89

27-Apr-17

36




iSAX Index


90

e.g., th=4, w=4, b=1

1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0




iSAX Index


91

1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0

e.g., th=4, w=4, b=1

Insert: 1 1 1 0

27-Apr-17

37




iSAX Index


92

1 1 10 0 1 1 10 0

1 1 11 0 1 1 11 0

e.g., th=4, w=4, b=1

1 1 11 0

1 1 1 0




iSAX Index


93

27-Apr-17

38




iSAX Index


94




iSAX Index


95

27-Apr-17

39

iSAX Index




• Approximate Search ▫ Match iSAX representation at each level

• Exact Search ▫ Leverage approximate search

▫ Prune search space

Lower bounding distance


96

Background

iSAX Index

97

ROOT

. . . 0 0 0 0

1 1 1 0 1 1 1 1

1 1 1 0 1 1 1 0

1 1 1 0

1 1 1 1

1 1 1 1

0 1

0 1

0

Approximate Search

Matches iSAX representation

at each level

Exact Search

Uses a lower bounding function


27-Apr-17

40

iSAX Index: Shortcomings & Challenges

• this is a wonderful index!


98



• … but why does it take soooo long to build for huge datasets?


99

27-Apr-17

41




• because iSAX implements

▫ a naive node splitting policy

may lead to ineffective splits and additional disk I/O


100




• because iSAX implements

▫ a naive node splitting policy

may lead to ineffective splits and additional disk I/O

▫ no bulk loading strategy

does not use available main memory to reduce disk I/O


101

27-Apr-17

42

iSAX 2.0

Bulk Loading Algorithm

• design principles:

▫ take advantage of available main memory

▫ maximize sequential disk accesses


102

iSAX 2.0


• intuition for proposed solution:

▫ for each leaf node, collect as many data series that belong to it as possible before materializing the leaf node

the raw values of data series in leaf nodes are written to disk


103

27-Apr-17

43

iSAX 2.0


• iterate between two phases (till all data series are indexed):

▫ Phase 1

read data series and group them according to first-level nodes

use all available main memory

▫ Phase 2

grow index by processing the subtree rooted at each one of the first-level nodes one at-a-time

flush leaf node contents to disk using sequential accesses


104

Publications

ICDM‘10


105

R

L1 L2 L3

main memory

disk

27-Apr-17

44


106

R

FBL

L1 L2 L3

main memory

disk


107

R

FBL

L1 L2 L3

main memory

disk

no limit in the size of FBLs!

27-Apr-17

45


108

R

insert new ds

FBL

L1 L2 L3

main memory

disk


109

R

FBL

L1 L2 L3

main memory

disk

27-Apr-17

46


110

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk


111

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

LBLs have same size as leaf nodes!

27-Apr-17

47


112

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk


113

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

27-Apr-17

48


114

R

L1 L2 I1

FBL

LBL

L4 L3

main memory

disk

no extra memory needed!


115

FBL

LBL

R

L1 L2 I1

L4 L3

main memory

disk

27-Apr-17

49


116

FBL

LBL

R

L1 L2 I1

L4 L3

main memory

disk

mainly sequential writes!

Experimental Evaluation

Bulk Loading


131

0

50

100

150

200

250

300

350

400

450

500

100M 200M 300M 400M 500M 800M 900M 1B

Tim

e T

o B

uil

d (

ho

ur

s)

N° Data Series Indexed

iSAX-BufferTree

iSAX

iSAX 2.0

• 1 Billion data series indexed in 16 days: 72% less time

• indexing time per data series: 0.001 sec

27-Apr-17

50

iSAX2+

• intuition for proposed solution:

▫ iSAX grows fast at the beginning of bulk loading, its shape stabilizing well before the end of the process

▫ several data series end up in leaf nodes that never need to split

▫ implement lazy splitting:

move raw data to leaf node the first time

if leaf node splits, do not move raw data until the end of index building process


148

Publications

KAIS‘14


Bulk Loading


167

• 1 Billion data series indexed in 10 days: 82% less time than iSAX

• indexing time per data series: 0.8 milliseconds

0

50

100

150

200

250

300

350

400

450

100M 200M 300M 400M 500M 800M 900M 1B

Tim

e t

o B

uild

( h

ou

rs )

Time Series Indexed

iSAX 2.0

iSAX2+

iSAX 2.0 Clustered

27-Apr-17

51

Drawback of iSAX2+

• cannot start answering queries until entire index is built!


173

Adaptive Data Series Index:

ADS+

• novel paradigm for building a data series index

▫ do not build entire index and then answer queries

▫ start answering queries by building the part of the index needed by those queries

• still guarantee correct answers


174

27-Apr-17

52

Adaptive Data Series Index:

ADS+

• intuition for proposed solution ▫ build the iSAX index using the iSAX representations

▫ just like iSAX2+

▫ but start with a large leaf size ▫ minimize initial cost

▫ postpone leaf materialization to query time

▫ only materialize (at query time) leaves needed by queries

▫ parts that are queried more are refined more ▫ use smaller leaf sizes (reduced leaf materialization and query

answering costs) Themis Palpanas - Telecom Paristech, Apr 2017

175

Publications

SIGMOD‘14

VLDBJ‘16

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Start building an index with only the

iSAX representations


27-Apr-17

53

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Read the data-series one by one from the raw file


ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Convert them to iSAX


27-Apr-17

54

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Store only iSAX in memory (64 times smaller) ~1%


ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Discard raw data and keep pointer to raw file


27-Apr-17

55

ROOT

I1

LBL

FBL

Raw data

I2

DISK

RAM

Continue loading data until we run out of memory


ROOT

I1

L3 L4 L1 L2

I2

LBL

FBL

Raw data

DISK

RAM

Expand each sub-tree and move data to LBL


27-Apr-17

56

Raw data

ROOT

I1

L3 L4 L1 L2

I2

LBL

FBL

DISK

RAM

We flush the data to the disk to free up memory


Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL


27-Apr-17

57

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1


Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

TOO BIG!


27-Apr-17

58

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5 L2

L1

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

TOO BIG!


Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

PARTIAL

L5 L4

Adaptive split

Create a smaller leaf


27-Apr-17

59

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

Query #1

PARTIAL

L5 L4

Load data in LBL and answer the query


Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAM L4

PARTIAL

FULL

L5 L4

We spill to the disk when we run out of memory


27-Apr-17

60



200

• iSAX 2.0 needs more than 35 hours to answer 100K queries

• ADS+ answers 100K queries in less than 5 hours

7x faster

Comparison to

multi-dimensional indices

1-3 orders of magnitude faster than multi-dimensional indexing methods

measure data-to-query time

(just index 1 billion data-series)

1

10

100

1000

10000

Ind

exin

g ti

me

(M

inu

tes)

6.6x faster

40x faster

130x faster

1000x faster


201

27-Apr-17

61

Demo

demo


202

• http://www.mi.parisdescartes.fr/~themisp/rinse/

Publications

VLDB‘15

Related Work

• significant amount of work in data series indexing

▫ e.g., TS-Tree [Assent et al. ‘08], iSAX [Shieh & Keogh ‘08]

• none of these approaches

▫ considered bulk loading

▫ examined more than 1 Million data series

• several studies for index bulk loading

▫ merge-based assume data is pre-clustered [Choubey et al. ‘99]

▫ buffering-based work only for balanced indices [Arge et al. ‘02] [Van den Bercken & Seeger ‘01] [Soisalon-Soininen & Widmayer ‘03]

• Adaptive indexing/file reorganization for column stores

▫ Database cracking [Idreos et al. ‘07], raw file cracking [Idreos et al. ’11]


203



27-Apr-17

62

Distribution/Parallelization/Cloud?


204


• discussion so far assumed a single core

▫ focus on efficient resource utilization

▫ squeeze the most out of a single core

▫ produce scalable solutions at lowest possible cost

also suitable for analysts with no access to/expertise for clusters


205

27-Apr-17

63


• further scale-up and scale-out possible!

▫ techniques inherently parallelizable

across cores, across machines

▫ more involved solutions required when optimizing for energy

minimize total work


206

0

1 2 … n

compute nodes/threads

First Buffer Layers

Leaf Buffer Layers

subset of collection that contains the answer

parallelizeddata series index

data series collection

Conclusions

• proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+

▫ indexing for very large data series collections

code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/

▫ current state of the art

• experimentally validated proposed approach

▫ first published experiments with 1 Billion data series


207



27-Apr-17

64

Conclusions

• proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+

▫ indexing for very large data series collections

code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/

▫ current state of the art

• experimentally validated proposed approach

▫ first published experiments with 1 Billion data series

• case studies in diverse domains exhibit usefulness of approach

▫ for the first time enable pain-free analysis of existing, vast collections of data series


208

What Next?

new challenge: index and mine 10 billion data series


209



27-Apr-17

65

What Next?

• infrastructure monitoring

▫ Facebook wants to manage 4B data series, 12M new values/sec


217

What Next?

218


27-Apr-17

66

219

What Next?



What Next?

220

27-Apr-17

67

The Road Ahead

“enable practitioners and non-expert users to easily and efficiently manage and analyze massive data series collections”


221

The Road Ahead

• Big Sequence Management System

▫ general purpose data series management system


222

data sequences

27-Apr-17

68

The Road Ahead

• Big Sequence Management System


223

257


27-Apr-17

69

Current and Past Collaborations

Infrastructure Monitoring

analysis and mining of hardware

and software infrastructure for

health monitoring

with Facebook, EDF, Safran

Human Behavior Patterns

identification of different social

groups, and analysis of their

macro- and micro-patterns of

behavior

with IBM Research, Telecom Italia

Human Brain Activity

analysis of fully-detailed

neurobiological data for explaining

brain functions

with ICM

Themis Palpanas - Telecom Paristech, Apr

2017

258

Green Manufacturing

analysis and optimization of

manufacturing processes for

energy savings

with SAP, Intel, Volvo, Infineon

eCrime

identification of fraudulent activities

related to the telecommunication

industry

with Telecom Italia, Vodafone, Wind

World Sentiments and Opinions

analysis of aggregate sentiment for

different social groups, role of

media in public sentiment

with Qatar Computing Research

Institute, and Hewlett-Packard Labs

Data-Intensive and Knowledge-Oriented systems

riding the big iot data wave · riding the big iot data wave complex analytics for iot data series...

Documents