efficient query filtering for streaming time series

24
Efficient Query Efficient Query Filtering for Filtering for Streaming Time Streaming Time Series Series Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra- Neto Computer Science & Engineering Dept. University of California – Riverside Riverside, CA 92521 {wli, eamonn}@cs.ucr.edu David Geffen School of Medicine University of California – Los Angeles Los Angeles, CA 90095 [email protected] ISCA Technologies Riverside, CA 92517 [email protected] ICDM '05

Upload: lucie

Post on 02-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Efficient Query Filtering for Streaming Time Series. ICDM '05. Outline of Talk. Introduction to time series Time series filtering Wedge-based approach Experimental results Conclusions. What are Time Series?. Time series are collections of observations made sequentially in time. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Query Filtering for Streaming Time Series

Efficient Query Filtering Efficient Query Filtering for Streaming Time Seriesfor Streaming Time Series

Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra-Neto

Computer Science & Engineering Dept.

University of California – Riverside

Riverside, CA 92521

{wli, eamonn}@cs.ucr.edu

David Geffen School of Medicine

University of California – Los Angeles

Los Angeles, CA 90095

[email protected]

ISCA Technologies

Riverside, CA 92517

[email protected]

ICDM '05

Page 2: Efficient Query Filtering for Streaming Time Series

Outline of TalkOutline of Talk• Introduction to time seriesIntroduction to time series

• Time series filteringTime series filtering

• Wedge-based approachWedge-based approach

• Experimental resultsExperimental results

• ConclusionsConclusions

Page 3: Efficient Query Filtering for Streaming Time Series

What are Time Series?What are Time Series?

0 20 40 60 80 100 120 140 160 180 2004.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

Time series are collections of observations made sequentially in time.

4.7275 4.7083 4.6700 4.6600 4.6617 4.6517 4.6500 4.6500 4.6917 4.7533 4.8233 4.8700 4.8783 4.8700 4.8500 4.8433 4.8383 4.8400 4.8433 . . .

Page 4: Efficient Query Filtering for Streaming Time Series

Time Series are EverywhereTime Series are EverywhereECG Heartbeat Image

Stock Video

Page 5: Efficient Query Filtering for Streaming Time Series

0 50 0 1000 150 0 2000 2500

0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

A B C

A B C

ClusteringClustering ClassificationClassification

Query by ContentRule Discovery

s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

Anomaly DetectionAnomaly Detection VisualizationVisualization

Time Series Data Mining TasksTime Series Data Mining Tasks

10

Page 6: Efficient Query Filtering for Streaming Time Series

2

1

4

3 7

6

5 9

8

10

11

12Candidates

Time Series FilteringTime Series Filtering

Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C.

Matches Q11

Time Series

Page 7: Efficient Query Filtering for Streaming Time Series

2

1

4

3 7

6

5 9

8

10

11

12Queries

Matches Q11

Database

Database

Query (template)

2

1

4

3

5

7

6

9

8

10

Database

Best match

Filtering vs. QueryingFiltering vs. Querying

Page 8: Efficient Query Filtering for Streaming Time Series

Euclidean Distance MetricEuclidean Distance MetricGiven two time series Q = q1…qn and C = c1…cn ,

the Euclidean distance between them is defined as:

n

iii cqCQD

1

2,

0 10 20 30 40 50 60 70 80 90 100

Q

C

Page 9: Efficient Query Filtering for Streaming Time Series

Early AbandonEarly AbandonDuring the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation.

0 10 20 30 40 50 60 70 80 90 100

calculation abandoned at this point

Q

C

Page 10: Efficient Query Filtering for Streaming Time Series

2

1

4

3 7

6

5 9

8

10

11

12Candidates

Classic ApproachClassic Approach

Individually compare each candidate sequence to the query using the early abandoning algorithm.

Time Series

Page 11: Efficient Query Filtering for Streaming Time Series

WedgeWedge

C2

C1

U

L

W

U

L

Q

W

Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki )

They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck .

We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L}

A lower bounding measure between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W:

n

iiiii

iiii

otherwise

LqifLq

UqifUq

WQKeoghLB1

2

2

0

)(

)(

),(_

Page 12: Efficient Query Filtering for Streaming Time Series

Generalized WedgeGeneralized Wedge• Use Use WW(1,2)(1,2) to denote that a wedge is built to denote that a wedge is built

from sequences from sequences CC11 and and CC2 2 ..

• Wedges can be hierarchally nested. For Wedges can be hierarchally nested. For example, example, WW((1,2),3)((1,2),3) consists of consists of WW(1,2)(1,2) and and CC3 3 ..

C1 (or W1 ) C2 (or W2 ) C3 (or W3 )

W(1, 2)

W((1, 2), 3)

Page 13: Efficient Query Filtering for Streaming Time Series

2

1

4

3 7

6

5 9

8

10

11

12Candidates

Wedge Based ApproachWedge Based Approach

• Compare the query to the wedge using LB_Keogh

• If the LB_Keogh function early abandons, we are done

• Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm

Time Series

Page 14: Efficient Query Filtering for Streaming Time Series

Examples of Wedge MergingExamples of Wedge Merging

W(1,2)

Q

W((1,2),3)

Q

C1 (or W1 ) C2 (or W2 )

W(1, 2)

C1 (or W1 ) C2 (or W2 ) C3 (or W3 )

W(1, 2)

W((1, 2), 3)

Page 15: Efficient Query Filtering for Streaming Time Series

Hierarchal Clustering Hierarchal Clustering

C1 (or W1)

C4 (or W4)

C2 (or W2)

C5 (or W5)

C3 (or W3)

W3

W2

W5

W1

W4

W3

W(2,5)

W1

W4

W3

W(2,5)

W(1,4)

W((2,5),3)

W(1,4)

W(((2,5),3), (1,4))

K = 5 K = 4 K = 3 K = 2 K = 1

Which wedge set to choose ?

Page 16: Efficient Query Filtering for Streaming Time Series

Which Wedge Set to Choose ?Which Wedge Set to Choose ?

• Test all Test all kk wedge sets on a representative wedge sets on a representative sample of datasample of data

• Choose the wedge set which performs the Choose the wedge set which performs the bestbest

Page 17: Efficient Query Filtering for Streaming Time Series

Upper Bound on Wedge Based ApproachUpper Bound on Wedge Based Approach

• Wedge based approach seems to be efficient when Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset.comparing a set of time series to a large batch dataset.

• But, what about streaming time series ?But, what about streaming time series ?– Streaming algorithms are limited by their Streaming algorithms are limited by their worstworst case. case.– Being efficient on Being efficient on averageaverage does not help. does not help.

• Worst caseWorst caseC1 (or W1 ) C2 (or W2 ) C3 (or W3 )

W(1, 2)

W((1, 2), 3)

Subsequence

Page 18: Efficient Query Filtering for Streaming Time Series

If dist(W((2,5),3), W(1,4)) >= 2 r

failscannot fail on both wedges

>= 2r

< r

W3

W2

W5

W1

W4

W3

W(2,5)

W1

W4

W3

W(2,5)

W(1,4)

W((2,5),3)

W(1,4)

W(((2,5),3), (1,4))

K = 5 K = 4 K = 3 K = 2 K = 1

Subsequence

?

Triangular Inequality

W((2,5),3)

W(1,4)

Page 19: Efficient Query Filtering for Streaming Time Series

Experimental SetupExperimental Setup• DatasetsDatasets

– ECG DatasetECG Dataset

– Stock DatasetStock Dataset

– Audio DatasetAudio Dataset

• We measure the number of computational steps used by the We measure the number of computational steps used by the following methods:following methods:– Brute forceBrute force

– Brute force with early abandoning (classic)Brute force with early abandoning (classic)

– Our approach (Atomic Wedgie)Our approach (Atomic Wedgie)

– Our approach with random wedge set (AWR)Our approach with random wedge set (AWR)

Page 20: Efficient Query Filtering for Streaming Time Series

ECG DatasetECG Dataset• Batch time seriesBatch time series

– 650,000 data points (half an 650,000 data points (half an hour’s ECG signals)hour’s ECG signals)

• Candidate setCandidate set– 200 time series of length 40200 time series of length 40

– 4 types of patterns4 types of patterns• left bundle branch block beatleft bundle branch block beat

• right bundle branch block beatright bundle branch block beat

• atrial premature beatatrial premature beat

• ventricular escape beatventricular escape beat

• rr = 0.5 = 0.5

• Upper Bound: 2,120 Upper Bound: 2,120 (8,000 for (8,000 for brute force)brute force)

AlgorithmAlgorithm Number of StepsNumber of Steps

brute forcebrute force 5,199,688,000 5,199,688,000

classicclassic 210,190,006210,190,006

Atomic WedgieAtomic Wedgie 8,853,0088,853,008

AWRAWR 29,480,26429,480,264

0

1

2

3

4

5

6x 10

9

Algorithms

Num

ber

of S

teps

brute force

classic Atomic

WedgieAWR

Page 21: Efficient Query Filtering for Streaming Time Series

Stock DatasetStock Dataset• Batch time seriesBatch time series

– 2,119,415 data points2,119,415 data points

• Candidate setCandidate set– 337 time series with length 128337 time series with length 128

– 3 types of patterns3 types of patterns• head and shouldershead and shoulders

• reverse head and shouldersreverse head and shoulders

• cup and handle cup and handle

• rr = 4.3 = 4.3

• Upper Bound: 18,048 Upper Bound: 18,048 (43,136 (43,136 for brute force)for brute force)

AlgorithmAlgorithm Number of StepsNumber of Steps

brute forcebrute force 91,417,607,168 91,417,607,168

classicclassic 13,028,000,000 13,028,000,000

Atomic WedgieAtomic Wedgie 3,204,100,000 3,204,100,000

AWRAWR 10,064,000,000 10,064,000,000

0

1

2

3

4

5

6

7

8

9

10x 10

10

Algorithms

Num

ber

of S

teps

brute force

classicAtomic

WedgieAWR

Page 22: Efficient Query Filtering for Streaming Time Series

Audio DatasetAudio Dataset• Batch time seriesBatch time series

– 37,583,512 data points (one hour’s 37,583,512 data points (one hour’s sound)sound)

• Candidate setCandidate set– 68 time series with length 5168 time series with length 51– 3 species of harmful mosquitoes3 species of harmful mosquitoes

• Culex quinquefasciatusCulex quinquefasciatus• Aedes aegyptiAedes aegypti• Culiseta sppCuliseta spp

• Sliding window: 11,025 (1 second)Sliding window: 11,025 (1 second)• Step: 5,512 (0.5 second)Step: 5,512 (0.5 second)• rr = 2 = 2• Upper Bound: 2,929 Upper Bound: 2,929 (6,868 for brute (6,868 for brute

force)force)

AlgorithmAlgorithm Number of StepsNumber of Steps

brute forcebrute force 57,485,160 57,485,160

classicclassic 1,844,997 1,844,997

Atomic WedgieAtomic Wedgie 1,144,778 1,144,778

AWRAWR 2,655,816 2,655,816

0

1

2

3

4

5

6x 10

7

Algorithms

Num

ber

of S

teps

brute force

classic Atomic

Wedgie AWR

Page 23: Efficient Query Filtering for Streaming Time Series

ConclusionsConclusions• We introduce the problem of time series We introduce the problem of time series

filtering.filtering.

• Combining similar sequences into a wedge is a Combining similar sequences into a wedge is a quite promising idea.quite promising idea.

• We have provided the upper bound of the cost We have provided the upper bound of the cost of the algorithm to compute the fastest arrival of the algorithm to compute the fastest arrival rate we can guarantee to handle.rate we can guarantee to handle.

Page 24: Efficient Query Filtering for Streaming Time Series

Questions?Questions?

All datasets used in this talk can be found at

http://www.cs.ucr.edu/~wli/ICDM05/