efficient query filtering for streaming time series

Efficient Query Filtering Efficient Query Filtering for Streaming Time Seriesfor Streaming Time Series

Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra-Neto

Computer Science & Engineering Dept.

University of California – Riverside

Riverside, CA 92521

{wli, eamonn}@cs.ucr.edu

David Geffen School of Medicine

University of California – Los Angeles

Los Angeles, CA 90095

[email protected]

ISCA Technologies

Riverside, CA 92517

[email protected]

ICDM '05

Outline of TalkOutline of Talk• Introduction to time seriesIntroduction to time series

• Time series filteringTime series filtering

• Wedge-based approachWedge-based approach

• Experimental resultsExperimental results

• ConclusionsConclusions

What are Time Series?What are Time Series?

0 20 40 60 80 100 120 140 160 180 2004.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

Time series are collections of observations made sequentially in time.

4.7275 4.7083 4.6700 4.6600 4.6617 4.6517 4.6500 4.6500 4.6917 4.7533 4.8233 4.8700 4.8783 4.8700 4.8500 4.8433 4.8383 4.8400 4.8433 . . .

Time Series are EverywhereTime Series are EverywhereECG Heartbeat Image

Stock Video

0 50 0 1000 150 0 2000 2500

0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

A B C

A B C

ClusteringClustering ClassificationClassification

Query by ContentRule Discovery

s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

Anomaly DetectionAnomaly Detection VisualizationVisualization

Time Series Data Mining TasksTime Series Data Mining Tasks

10

2

1

4

3 7

6

5 9

8

10

11

12Candidates

Time Series FilteringTime Series Filtering

Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C.

Matches Q11

Time Series

2

1

4

3 7

6

5 9

8

10

11

12Queries

Matches Q11

Database

Database

Query (template)

2

1

4

3

5

7

6

9

8

10

Database

Best match

Filtering vs. QueryingFiltering vs. Querying

Euclidean Distance MetricEuclidean Distance MetricGiven two time series Q = q1…qn and C = c1…cn ,

the Euclidean distance between them is defined as:

n

iii cqCQD

1

2,

0 10 20 30 40 50 60 70 80 90 100

Q

C

Early AbandonEarly AbandonDuring the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation.

0 10 20 30 40 50 60 70 80 90 100

calculation abandoned at this point

Q

C

2

1

4

3 7

6

5 9

8

10

11

12Candidates

Classic ApproachClassic Approach

Individually compare each candidate sequence to the query using the early abandoning algorithm.

Time Series

WedgeWedge

C2

C1

U

L

W

U

L

Q

W

Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki )

They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck .

We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L}

A lower bounding measure between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W:

n

iiiii

iiii

otherwise

LqifLq

UqifUq

WQKeoghLB1

2

2

0

)(

)(

),(_

Generalized WedgeGeneralized Wedge• Use Use WW(1,2)(1,2) to denote that a wedge is built to denote that a wedge is built

from sequences from sequences CC11 and and CC2 2 ..

• Wedges can be hierarchally nested. For Wedges can be hierarchally nested. For example, example, WW((1,2),3)((1,2),3) consists of consists of WW(1,2)(1,2) and and CC3 3 ..

C1 (or W1 ) C2 (or W2 ) C3 (or W3 )

W(1, 2)

W((1, 2), 3)

2

1

4

3 7

6

5 9

8

10

11

12Candidates

Wedge Based ApproachWedge Based Approach

• Compare the query to the wedge using LB_Keogh

• If the LB_Keogh function early abandons, we are done

• Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm

Time Series

Examples of Wedge MergingExamples of Wedge Merging

W(1,2)

Q

W((1,2),3)

Q

C1 (or W1 ) C2 (or W2 )

W(1, 2)

C1 (or W1 ) C2 (or W2 ) C3 (or W3 )

W(1, 2)

W((1, 2), 3)

Hierarchal Clustering Hierarchal Clustering

C1 (or W1)

C4 (or W4)

C2 (or W2)

C5 (or W5)

C3 (or W3)

W3

W2

W5

W1

W4

W3

W(2,5)

W1

W4

W3

W(2,5)

W(1,4)

W((2,5),3)

W(1,4)

W(((2,5),3), (1,4))

K = 5 K = 4 K = 3 K = 2 K = 1

Which wedge set to choose ?

Which Wedge Set to Choose ?Which Wedge Set to Choose ?

• Test all Test all kk wedge sets on a representative wedge sets on a representative sample of datasample of data

• Choose the wedge set which performs the Choose the wedge set which performs the bestbest

Upper Bound on Wedge Based ApproachUpper Bound on Wedge Based Approach

• Wedge based approach seems to be efficient when Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset.comparing a set of time series to a large batch dataset.

• But, what about streaming time series ?But, what about streaming time series ?– Streaming algorithms are limited by their Streaming algorithms are limited by their worstworst case. case.– Being efficient on Being efficient on averageaverage does not help. does not help.

• Worst caseWorst caseC1 (or W1 ) C2 (or W2 ) C3 (or W3 )

W(1, 2)

W((1, 2), 3)

Subsequence

If dist(W((2,5),3), W(1,4)) >= 2 r

failscannot fail on both wedges

>= 2r

< r

W3

W2

W5

W1

W4

W3

W(2,5)

W1

W4

W3

W(2,5)

W(1,4)

W((2,5),3)

W(1,4)

W(((2,5),3), (1,4))

K = 5 K = 4 K = 3 K = 2 K = 1

Subsequence

?

Triangular Inequality

W((2,5),3)

W(1,4)

Experimental SetupExperimental Setup• DatasetsDatasets

– ECG DatasetECG Dataset

– Stock DatasetStock Dataset

– Audio DatasetAudio Dataset

• We measure the number of computational steps used by the We measure the number of computational steps used by the following methods:following methods:– Brute forceBrute force

– Brute force with early abandoning (classic)Brute force with early abandoning (classic)

– Our approach (Atomic Wedgie)Our approach (Atomic Wedgie)

– Our approach with random wedge set (AWR)Our approach with random wedge set (AWR)

ECG DatasetECG Dataset• Batch time seriesBatch time series

– 650,000 data points (half an 650,000 data points (half an hour’s ECG signals)hour’s ECG signals)

• Candidate setCandidate set– 200 time series of length 40200 time series of length 40

– 4 types of patterns4 types of patterns• left bundle branch block beatleft bundle branch block beat

• right bundle branch block beatright bundle branch block beat

• atrial premature beatatrial premature beat

• ventricular escape beatventricular escape beat

• rr = 0.5 = 0.5

• Upper Bound: 2,120 Upper Bound: 2,120 (8,000 for (8,000 for brute force)brute force)

AlgorithmAlgorithm Number of StepsNumber of Steps

brute forcebrute force 5,199,688,000 5,199,688,000

classicclassic 210,190,006210,190,006

Atomic WedgieAtomic Wedgie 8,853,0088,853,008

AWRAWR 29,480,26429,480,264

0

1

2

3

4

5

6x 10

9

Algorithms

Num

ber

of S

teps

brute force

classic Atomic

WedgieAWR

Stock DatasetStock Dataset• Batch time seriesBatch time series

– 2,119,415 data points2,119,415 data points

• Candidate setCandidate set– 337 time series with length 128337 time series with length 128

– 3 types of patterns3 types of patterns• head and shouldershead and shoulders

• reverse head and shouldersreverse head and shoulders

• cup and handle cup and handle

• rr = 4.3 = 4.3

• Upper Bound: 18,048 Upper Bound: 18,048 (43,136 (43,136 for brute force)for brute force)


brute forcebrute force 91,417,607,168 91,417,607,168

classicclassic 13,028,000,000 13,028,000,000

Atomic WedgieAtomic Wedgie 3,204,100,000 3,204,100,000

AWRAWR 10,064,000,000 10,064,000,000

0

1

2

3

4

5

6

7

8

9

10x 10

10

Algorithms

Num

ber

of S

teps

brute force

classicAtomic

WedgieAWR

Audio DatasetAudio Dataset• Batch time seriesBatch time series

– 37,583,512 data points (one hour’s 37,583,512 data points (one hour’s sound)sound)

• Candidate setCandidate set– 68 time series with length 5168 time series with length 51– 3 species of harmful mosquitoes3 species of harmful mosquitoes

• Culex quinquefasciatusCulex quinquefasciatus• Aedes aegyptiAedes aegypti• Culiseta sppCuliseta spp

• Sliding window: 11,025 (1 second)Sliding window: 11,025 (1 second)• Step: 5,512 (0.5 second)Step: 5,512 (0.5 second)• rr = 2 = 2• Upper Bound: 2,929 Upper Bound: 2,929 (6,868 for brute (6,868 for brute

force)force)


brute forcebrute force 57,485,160 57,485,160

classicclassic 1,844,997 1,844,997

Atomic WedgieAtomic Wedgie 1,144,778 1,144,778

AWRAWR 2,655,816 2,655,816

0

1

2

3

4

5

6x 10

7

Algorithms

Num

ber

of S

teps

brute force

classic Atomic

Wedgie AWR

ConclusionsConclusions• We introduce the problem of time series We introduce the problem of time series

filtering.filtering.

• Combining similar sequences into a wedge is a Combining similar sequences into a wedge is a quite promising idea.quite promising idea.

• We have provided the upper bound of the cost We have provided the upper bound of the cost of the algorithm to compute the fastest arrival of the algorithm to compute the fastest arrival rate we can guarantee to handle.rate we can guarantee to handle.

Questions?Questions?

All datasets used in this talk can be found at

http://www.cs.ucr.edu/~wli/ICDM05/

http://www.brickshelf.com/gallery/Mask-Of-Light/More/homer.gif

efficient query filtering for streaming time series

Documents

set of time series

time series q

time series t

wedge w

wedge sets

streaming time seriesicdm

r distance

set of candidates c