one-pass wavelet decompositions of data streams

One-Pass Wavelet Decompositions of Data Streams

TKDE May 2002

Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss

Presented by James ChanCSCI 599 Multidimensional Databases

Fall 2003

Outline of Talk

• Introduction

• Background

• Proposed Algorithm

• Experiments

• End Notes

Streaming Applications

• Telephone Call Duration

• Call Detail Record (CDR)

• IP Traffic Flow

• Bank ATM Transactions

• Mission Critical Task:– Fraud– Security– Performance Monitoring

Data Stream Model

Data Stream Problem• One Pass – no backtracking• Unbounded Data – Algorithms require small memory usage• Continuous – Need to run real time

Stream ProcessingEngine

(Approximate) Answer

Synopsis in Memory

Data Streams

Data Stream Strategies

• Many stream algorithms produce approximate answers and have:– Deterministic Bounds: answers are within ±– Probabilistic Bounds: answers have high

success probability (1-) within ±

Data Stream Strategies

• Windows: New elements expire after time t

• Samples: Approximate entire domain with a sample

• Histograms: Partitioning element domain values into buckets (Equi-depth, V-Opt)

• Wavelets: Haar, Construction and maintenance (difficult for large domain)

• Sketch Techniques: estimate of L2 norm of a signal

Proposed Stream Model

Background: Cash Register vs. Aggregate

• Cash Register: incoming stream represents domain (increment or decrement range of that domain)

• Aggregate: incoming stream represents range, (update range of that domain)

Note: Examples in this paper assume– each cash register element as +1 unit– no duplicate elements in aggregate models

Background: Cash Register vs. Aggregate

Cash Register

(domain)

Aggregate

(range)

Ordered Easiest

Eg. Time Series

Unordered General Challenging

Eg. Network volume

Contiguous Same as aggregate unordered n/a

Background: Wavelet Basics

• Wavelet transforms capture trends in a signal

• Typical transform involves log n passes

• Each pass creates two sets of n/2 averages and differences.

• Process repeated on averages

• Output: Wavelet Basis vectors – one average and n-1 coefficients

Background: Haar Wavelet Notation

• High pass filter

• Low pass filter

• Input: signal a

• Basis Coefficients

• Coefficients

• Scaling Factor

• Psi Vectors (un-normalized)

}2/1,2/1{

}2/1,2/1{

],...,[ 21 naaa

kjjkj asd ,, ,

}{}{],...,[ ,0,0110 kjn dcwww

jj Ns 2/

kj ,

]0,0,0,0,1,1,1,1[

]0,0,0,0,1,1,0,0[

]0,0,0,0,0,0,1,1[

0,2

2,3

0,3

kj

kj

kj

Background: Haar Wavelet Example

Background: Small B Representation

• Most signals in nature have small B representation

• Only keep largest B wavelet coefficients to estimate energy of signal

• Additional coefficients do not help reduce squared sum error

Energy:

SSE:

2

2R

2

2Ra

Background: Storage

• Highest B wavelet coefficients

• Log N Straddling coefficients, one per level of the wavelet tree

2 2 0 2 3 5 4 4

-1.25

2.75

0.5 0

0 -1 0 -1

+

-+

+

+ + +

+

+

- -

- - - -

Original Signal

Background: Bounding Theorems

Theorem 1• Given O(B+logN) storage (B is number of dimensions)• time to compute new data item is O(B+logN) in ordered

aggregate model

Theorem 2• Any algorithm that calculates the 2nd largest wavelet coefficient of

the signal in unordered CR / unordered agg uses at least N/polylog(N)

• This holds if:– You only care about existence, not the coefficients value– Only calculating up to a factor of 2

Proposed Algorithm: Overview

• Avoid keeping anything domain size N in memory

• Estimate wavelet coefficients using sketches which are size log(N)

• Sketch is maintained in memory and is updated as data entries stream in

What’s a Sketch?

• Distortion Parameter (epsilon)• Failure Probability (delta)• Failure Threshold (eta)• Original Signal a• Random vector of {-1,+1}s r• Seed for r s• Atomic Sketch <a,r> dot product of a and r• Sketch O(log(N/ )/ ^2) atomic sketches

• We use the same j to index the atomic sketch, seed, and random vector, so there are j atomic sketches in a sketch

Updating a Sketch

• Cash Register– Add corresponding to the j atomic sketches

• Aggregate– Add corresponding to the j atomic sketches

Use generator that takes in seed which is log(N) to compute

jir

jiria )(

jis

),( isGr jji

jir

Reed Muller Generator

• Pseudo random generator meeting these requirements:– Variables are 4 wise

independent• Expected value of product of any

4 distinct r is 0

– Requires O(log N) space for seeding

– Performs computation in polylog(N) time

{0} {d} {c} …. {d,c,b,a}

X = median ( )

Estimation of Inner Product

…

O(log(1/))

O(log(1/^2))

= mean ( )

… … …

…

Boosting Accuracy and Confidence

• Improve accuracy to by averaging over more independent copies of for each average

• Improve Confidence by increasing number of averages to take median over

j

iijii rbra

O(log(1/^2)) copies of …

X = median of ( )

= means ( )…

…

O(log(1/)) copies of …

Using the sketches

j

iijii rbraba ~,

• We can approximate <a,> to maintain Bs• Note a point query is <a,ei> where e is a vector

with a 1 at index i and 0s everywhere else

jj rbra ,,Atomic Sketches in memory

Maintaining Top B Coefficients

• At most Log N +1 coefficient updates

• May need to approximate straddling coefficients to aggregate with already existing or near variables

• Compare updates with top B and update top B if necessary

ia

updated

unaffected

Algorithm Space and Time

Their algorithm uses polylog(N) space and per item time to maintain B terms (by approximation)

Experiments• Data: one week of AT&T call detail (unordered cash

register model)

• Modes– Batch: Query only between intervals– Online: Query anytime

• Direct Point: calc sketch of <ei,a> (ei is zero vector except with 1 at i)

• Direct Wavelets: estimate all supporting coefficients and use wavelet reconstruction to calculate point a(i)

• Top B: Reconstruction of point is done with Top B (maintained by sketch)

Top B – Day 0

Top B - 1 Week

(fixed-set) Value updates only. no replacement

Sketch Size on Accuracy

Heavy Hitters

• Points that contribute significantly to the energy of the signal

• Direct point estimates are very accurate for heavy hitters but gross estimates for non heavy hitters

• Adaptive Greedy pursuit: by removing the first heavy hitter from the signal, you improve the accuracy of calculating the next biggest heavy hitter

• However an error is introduced with each subtraction of a heavy hitter

Processing Heavy Hitters

Adaptive Greedy Pursuit

End Notes

• First Provable Guarantees for haar wavelet over data streams

• Can estimate Haar coefficients ci=<a,>• Top B is updated in:

• This paper is superseded by "Fast, Small-space algorithms for approximatehistogram maintenance" STOC 2002– Discusses how to select top B and find heavy hitters

))()log((log 3 BNNO

one-pass wavelet decompositions of data streams

Documents

incoming stream

largest b wavelet coefficients

wavelet basicswavelet

wavelet treebackground

data stream strategieswindows

haar wavelet examplebackground

wavelet basis vectors

largest wavelet coefficient