maintaining time-decaying stream aggregates
DESCRIPTION
Maintaining Time-Decaying Stream Aggregates. Edith Cohen Martin Strauss AT&T Labs-research. The Problem. A data stream is a sequence of data items observed over time. Presence of multiple massive data streams. - PowerPoint PPT PresentationTRANSCRIPT
PODS 2003
Maintaining Time-Decaying Stream
AggregatesEdith Cohen Martin Strauss
AT&T Labs-research
PODS 2003
The Problem• A data stream is a sequence of data items
observed over time.• Presence of multiple massive data
streams.• Storage constraints allow only to maintain
a compact summary of the “essence” of information in each stream.
• Relevance of information decays with time.• Thus, when aggregating across time, older
information should be discounted.
PODS 2003
Applications
• IP routing - RED protocol: time-decayed average of previous queue lengths is used to estimate impending congestion at router
• Internet gateway selection: tracks the quality (eg packet loss rate) of alternative paths to select a more reliable one.
• Usage statistics of phone customers: AT&T has about 100M customers.
• More …..
PODS 2003
Decay Functions
• A decay function is non-increasing g(x)>=0 defined for x>=1.
• f(t) >= 0 is the value of the data item observed at time t.
• The weight at time T of an item obtained at time t is g(T-t)
• The decayed value of the item is f(t)g(T-t)
PODS 2003
Time-Decaying Sum
• When f(t) are 0/1 we refer to the problem as time-decaying count.
• Maintaining the decaying sum exactly can generally consume linear bits.
• We consider approximately maintaining it to within
Ttg tTgtfTV )()()(
1
PODS 2003
Time-Decaying Average
• Time-decaying weighted average of observed values.
• is the value of item observed at time)( itf it
Ttii
Ttiii
g
i
i
tTg
tTgtf
TA
|
|
)(
)()(
)(
Maintaining time-decaying average reduces to maintaining two time-decaying sums
PODS 2003
Interesting Families of Decay Functions
WSliWin• Sliding Windows [DGIM02] g(x)=1 for x<W g(x)=0 otherwise
xxg /1)( PolyD• Polynomial decay
• General Decay functions…
)exp()( xxg ExpD• Exponential decay [Jacobson 88]
PODS 2003
Exponential Decay
• Used in networking applications (RED)• Very simple maintenance:
)1()exp()()( tVtftV ExpDExpD
Lemma:• Exact tracking requires storage bits• Approximate tracking uses bits
)(N)(logN
PODS 2003
Sliding Window Decay
Lemma: [DGIM02]Sliding window decay can be approximately tracked using bits (for 0/1 or poly size values).
)(log2W
• “Sharp Threshold” • Upper bound using the Exponential Histogram (EH) technique.
PODS 2003
Polynomial Decay Lemma: Lower bound: Upper bound: (N is elapsed time)
)loglog(log NNO)(logN
• Often more appropriate to applications than Exponential or Sliding Window decay
• More efficient than SliWin decay (nearly quadratic gap), almost as efficient as Exponential decay.
PODS 2003
General Decay Functions• Lemma: Can be (approximately) maintained using bits (N is minimum of elapsed
time and min x for which g(x)=0 ))(log2 NO
• Algorithm based on an adaptation of the Exponential Histograms technique.
• Sliding windows, (with ), [DGIM02] are as “hard” to maintain as general decay
)(log2 N
PODS 2003
Why Polynomial Decay?
• Link performance over time
Link A
Link B
Time
good
bad
t0
Which link should we select past time t0?
Initially A or B, eventually B.
PODS 2003
Link Selection Example) cont)
• Polynomial decay (by tuning parameter): Initially A or B, eventually B.
• Exponential decay: Constant relative value of A and B: Either A forever or B forever
• Sliding Window decay: First B then A then same…
Poly decay can model our expectation (alsoother smooth subexponential functions…)
PODS 2003
Summary of Bounds
functionbound
Exp decay
Poly decay
SliWin decay
General decay
Upper
Lower
)(log2 NO)(log2 NO)loglog(log NNO
)(logN)(logN
)(logNO
)(log2 N
• N is minimum of elapsed time and min x for which g(x)=0
• Approximate to within 1
)(log2 N
PODS 2003
Bucketing the StreamTime
1 0 0 0 111 00 1
Time width: 3Count: 1
Time width: 4Count: 2
Time width: 3Count: 2
Time width: 7Count: 4
Merge
• Histogram determined by time boundaries and bucket counts• Time boundaries can be fixed (counts maintained per stream)• Counts can be fixed (time boundaries maintained per stream)
PODS 2003
Exponential Histograms [DGIM02]
• Introduced for Sliding Windows• Each new item is placed in a new bucket.• Two buckets are merged when their
combined count is at most a fraction of the combined count of all earlier buckets.
• Buckets with start time greater than W are discarded.
• Bucket counts are independent of stream
• Sum of bucket counts is a constant-factor approximation for
WSliWin
WSliWin
PODS 2003
Exponential Histograms (cont)
• Example for factor 2 approximation: (bucket counts)
• 1• 1, 1• 1, 1, 1• 1, 1, 2 (merge)• 1, 1, 1, 2• 1, 1, 2, 2 (merge)
• Values with time “in question” (before or after W) are aggregated in least recent bucket.
PODS 2003
EHs properties• Number of buckets is O(log W), for each bucket we need to record exact
start time, thus we need O(log W) storage per bucket. (total is O(log^2 W))
• An EH for Sliding Window W can be used to approximate Sliding Window j for all j<W
Lemma:EH can be used to approximate general decay functions. (With W= minimum of elapsed time and min x for which g(x)=0.)
PODS 2003
Reducing any Decay Function to Sliding
Windows.• Decay function g(x)
TtNT
g tTgtfTV )()()(
TtNT
N
i TtiNT
tfiNgiNgtfNg1
1
)())1()(()()(
)())1()(()()(1
1
TSliWiniNgiNgTSliWinNgN
iiNN
From (approximate) for all W<=N we can compute (approximate) decayed sum according to
g.()
WSliWin
With an EH with W=N we can compute (approximately) decayed sums according to all decay functions g() up to elapsed time N (or forever if g(N)=0).
PODS 2003
Weight-Based Merging• Bucket start times depend only on elapsed
time.• WBM Histograms applies to decay
functions where g(x)/g(x+1) is non-increasing.
• Number of buckets is O(log(g(1)/g(N))).• O(log log N) storage per bucket (for
approximate bucket counts).• More efficient than EH on decay that is
slightly super-polynomial or slower.• O(log N log log N) storage for polynomial
decay
PODS 2003
WBM Histograms – How?
• Region boundaries b1,b2,b3,… :
)1()1()1(maxarg1 gxgb x )()1()1(maxarg 1 ixi bgxgb
• Current most-recent bucket is sealed and new bucket is started at T s.t. T mod b1=0
• Two consecutive buckets that are in the same region (according to elapsed start and end times) are merged.
• At most 2 buckets per region
PODS 2003
WBMH Example
• g(x)=1/x, (1+)=2• Regions:1,1/2, 1/3,1/4,1/5,1/6, 1/7,1/8,…,1/14
T=1
T=3
T=4
T=5
T=6
T=2
PODS 2003
Conclusion
Summary:
• Efficient computation of time-decayed sum/averages for general decay functions.
• Very efficient computation for polynomial decay
• Open question: O(log n) storage for polynomial decay
• Subsequent related work: Spatial decay (sensor nets/p2p nets)