one-pass wavelet decompositions of data streams
Post on 20-Jan-2016
40 Views
Preview:
DESCRIPTION
TRANSCRIPT
One-Pass Wavelet Decompositions of Data Streams
TKDE May 2002
Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss
Presented by James ChanCSCI 599 Multidimensional Databases
Fall 2003
Outline of Talk
• Introduction
• Background
• Proposed Algorithm
• Experiments
• End Notes
Streaming Applications
• Telephone Call Duration
• Call Detail Record (CDR)
• IP Traffic Flow
• Bank ATM Transactions
• Mission Critical Task:– Fraud– Security– Performance Monitoring
Data Stream Model
Data Stream Problem• One Pass – no backtracking• Unbounded Data – Algorithms require small memory usage• Continuous – Need to run real time
Stream ProcessingEngine
(Approximate) Answer
Synopsis in Memory
Data Streams
Data Stream Strategies
• Many stream algorithms produce approximate answers and have:– Deterministic Bounds: answers are within ±– Probabilistic Bounds: answers have high
success probability (1-) within ±
Data Stream Strategies
• Windows: New elements expire after time t
• Samples: Approximate entire domain with a sample
• Histograms: Partitioning element domain values into buckets (Equi-depth, V-Opt)
• Wavelets: Haar, Construction and maintenance (difficult for large domain)
• Sketch Techniques: estimate of L2 norm of a signal
Proposed Stream Model
Background: Cash Register vs. Aggregate
• Cash Register: incoming stream represents domain (increment or decrement range of that domain)
• Aggregate: incoming stream represents range, (update range of that domain)
Note: Examples in this paper assume– each cash register element as +1 unit– no duplicate elements in aggregate models
Background: Cash Register vs. Aggregate
Cash Register
(domain)
Aggregate
(range)
Ordered Easiest
Eg. Time Series
Unordered General Challenging
Eg. Network volume
Contiguous Same as aggregate unordered n/a
Background: Wavelet Basics
• Wavelet transforms capture trends in a signal
• Typical transform involves log n passes
• Each pass creates two sets of n/2 averages and differences.
• Process repeated on averages
• Output: Wavelet Basis vectors – one average and n-1 coefficients
Background: Haar Wavelet Notation
• High pass filter
• Low pass filter
• Input: signal a
• Basis Coefficients
• Coefficients
• Scaling Factor
• Psi Vectors (un-normalized)
}2/1,2/1{
}2/1,2/1{
],...,[ 21 naaa
kjjkj asd ,, ,
}{}{],...,[ ,0,0110 kjn dcwww
jj Ns 2/
kj ,
]0,0,0,0,1,1,1,1[
]0,0,0,0,1,1,0,0[
]0,0,0,0,0,0,1,1[
0,2
2,3
0,3
kj
kj
kj
Background: Haar Wavelet Example
Background: Small B Representation
• Most signals in nature have small B representation
• Only keep largest B wavelet coefficients to estimate energy of signal
• Additional coefficients do not help reduce squared sum error
Energy:
SSE:
2
2R
2
2Ra
Background: Storage
• Highest B wavelet coefficients
• Log N Straddling coefficients, one per level of the wavelet tree
2 2 0 2 3 5 4 4
-1.25
2.75
0.5 0
0 -1 0 -1
+
-+
+
+ + +
+
+
- -
- - - -
Original Signal
Background: Bounding Theorems
Theorem 1• Given O(B+logN) storage (B is number of dimensions)• time to compute new data item is O(B+logN) in ordered
aggregate model
Theorem 2• Any algorithm that calculates the 2nd largest wavelet coefficient of
the signal in unordered CR / unordered agg uses at least N/polylog(N)
• This holds if:– You only care about existence, not the coefficients value– Only calculating up to a factor of 2
Proposed Algorithm: Overview
• Avoid keeping anything domain size N in memory
• Estimate wavelet coefficients using sketches which are size log(N)
• Sketch is maintained in memory and is updated as data entries stream in
What’s a Sketch?
• Distortion Parameter (epsilon)• Failure Probability (delta)• Failure Threshold (eta)• Original Signal a• Random vector of {-1,+1}s r• Seed for r s• Atomic Sketch <a,r> dot product of a and r• Sketch O(log(N/ )/ ^2) atomic sketches
• We use the same j to index the atomic sketch, seed, and random vector, so there are j atomic sketches in a sketch
Updating a Sketch
• Cash Register– Add corresponding to the j atomic sketches
• Aggregate– Add corresponding to the j atomic sketches
Use generator that takes in seed which is log(N) to compute
jir
jiria )(
jis
),( isGr jji
jir
Reed Muller Generator
• Pseudo random generator meeting these requirements:– Variables are 4 wise
independent• Expected value of product of any
4 distinct r is 0
– Requires O(log N) space for seeding
– Performs computation in polylog(N) time
{0} {d} {c} …. {d,c,b,a}
X = median ( )
Estimation of Inner Product
…
O(log(1/))
O(log(1/^2))
= mean ( )
… … …
…
Boosting Accuracy and Confidence
• Improve accuracy to by averaging over more independent copies of for each average
• Improve Confidence by increasing number of averages to take median over
j
iijii rbra
O(log(1/^2)) copies of …
X = median of ( )
= means ( )…
…
O(log(1/)) copies of …
Using the sketches
j
iijii rbraba ~,
• We can approximate <a,> to maintain Bs• Note a point query is <a,ei> where e is a vector
with a 1 at index i and 0s everywhere else
jj rbra ,,Atomic Sketches in memory
Maintaining Top B Coefficients
• At most Log N +1 coefficient updates
• May need to approximate straddling coefficients to aggregate with already existing or near variables
• Compare updates with top B and update top B if necessary
ia
updated
unaffected
Algorithm Space and Time
Their algorithm uses polylog(N) space and per item time to maintain B terms (by approximation)
Experiments• Data: one week of AT&T call detail (unordered cash
register model)
• Modes– Batch: Query only between intervals– Online: Query anytime
• Direct Point: calc sketch of <ei,a> (ei is zero vector except with 1 at i)
• Direct Wavelets: estimate all supporting coefficients and use wavelet reconstruction to calculate point a(i)
• Top B: Reconstruction of point is done with Top B (maintained by sketch)
Top B – Day 0
Top B - 1 Week
(fixed-set) Value updates only. no replacement
Sketch Size on Accuracy
Heavy Hitters
• Points that contribute significantly to the energy of the signal
• Direct point estimates are very accurate for heavy hitters but gross estimates for non heavy hitters
• Adaptive Greedy pursuit: by removing the first heavy hitter from the signal, you improve the accuracy of calculating the next biggest heavy hitter
• However an error is introduced with each subtraction of a heavy hitter
Processing Heavy Hitters
Adaptive Greedy Pursuit
End Notes
• First Provable Guarantees for haar wavelet over data streams
• Can estimate Haar coefficients ci=<a,>• Top B is updated in:
• This paper is superseded by "Fast, Small-space algorithms for approximatehistogram maintenance" STOC 2002– Discusses how to select top B and find heavy hitters
))()log((log 3 BNNO
top related