Transcript
Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Better Burst

Better Burst Detection

Xin Zhang Dennis ShashaDepartment of Computer Science

Courant Institute of Mathematical SciencesNew York University

{xinzhang,shasha}@cs.nyu.edu

Abstract

A burst is a large number of events occurring within acertain time window. Many data stream applications re-quire the detection of bursts across a variety of windowsizes. For example, stock traders may be interested in burstshaving to do with institutional purchases or sales that arespread out over minutes or hours.

In this paper, we present a new algorithmic frameworkfor elastic burst detection [1]: a family of data structuresthat generalizes the Shifted Binary Tree, and a heuristicsearch algorithm to find an efficient structure given the in-put. We study how different inputs affect the desired struc-tures and the probability to trigger a detailed search. Exper-iments on both synthetic and real world data show a factorof up to 35 times improvement compared with the Shifted Bi-nary Tree over a wide variety of inputs, depending on the in-puts.

1. Introduction

A burst is a large number of events occurring withina certain time window. It’s a noteworthy phenomenonin many natural and social processes. For example,stock traders are interested in bursts of trading vol-ume, which may reflect hidden news. Astrophysicists areinterested in gamma ray bursts, which may reflect the oc-currence of a supernova. Furthurmore, many data applica-tions require detection of bursts across a variety of win-dow sizes. For example, interesting gamma ray burstscould last several seconds, several minutes or even sev-eral days.

The elastic burst detection problem [1] is to detect burstsacross multiple window sizes. When the aggregate in a win-dow exceeds the threshold for that window size, a burst oc-curs. A naive algorithm is to check each window size of in-terest one at a time. To detect bursts over k window sizesin a sequence of length N naively requires O(kN) time.

In [1], the authors show that a simple data structure calledthe Shifted Binary Tree can beat the naive algorithm by sev-eral orders of magnitude when the probability of bursts isvery low.

A Shifted Binary Tree includes a binary tree as the basestructure. It also includes shifted sublevels to each base sub-level above level 0. The shifted sublevel i is still of length 2i,but the correspondending window is shifted by 2i−1 fromthe base sublevel. Figure 1.a shows an example.

The overlap between the base sublevels and the shiftedsublevels guarantees that all the windows of length w, w ≤1 + 2i, are included in one of the windows at level i + 1.Let f(w) be the threshold for size w. Because the aggrega-tion function is monotonically increasing, whenever morethan f(2+2i−1) events are found in a window of size 2i+1,then a detailed search must be performed to check if somesubwindow of size w, 2 + 2i−1 ≤ w ≤ 1 + 2i, has f(w)events. All bursts are guaranteed to be reported and manynon-burst windows are filtered away without requiring a de-tailed check. Unfortunately, when bursts are rare but notvery rare, the number of fruitless detailed searches grows,suggesting we may want more levels than Shifted BinaryTree provides; conversely, when bursts are exceedingly rarewe may need fewer levels. In other words, we want a struc-ture that adapts to the input.

This paper presents a family of multiresolution over-lapping data structures, called Shifted Aggregation Trees,which generalizes the Shifted Binary Tree and includesmany other structures. We then present a heuristic search al-gorithm to find an efficient data structure given the sampleinput series and window thresholds. We theoretically ana-lyze and empirically study how different inputs affect thedesired structures and the probability to trigger a detailedsearch. Experiments on both synthetic data and real worlddata show that the Shifted Aggregation Tree outperformsthe Shifted Binary Tree over a variety of inputs, yieldingup to a factor of 35 times speedup in some cases. Due tospace limitations, details and references missing in this pa-per can be found in [2].

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Better Burst

Level 0

Level 1

Level 2

Level 3

Level 4

(a) Shifted Binary Tree

Level 0Level 1

Level 2

Level 3

Level 4

(b) Embed a Shifted Binary Tree (SBT) in an Ag-gregation Pyramid (AP). Each shaded cell in theAP corresponds to a node in the SBT. The differ-ent shadings in level 2 show the one-to-one cor-respondence.

(c) An example of Shifted AggregationTree embedded in the Aggregation Pyra-mid

Figure 1. Shifted Binary Tree, AggregationPyramid and Shifted Aggregation Tree.

2. Shifted Aggregation Tree

2.1. Embed a Shifted Binary Tree in an Aggrega-tion Pyramid

Aggregation Pyramid is an N -level isosceles triangular-shaped data structure built over a time window of size Nshown in Figure 1.b and 1.c: level 0 has N cells and isin one-to-one correspondence with the original time series;level 1 has N − 1 cells, the first cell stores the aggregateof the first two data items (say, data items 1 and 2) in theoriginal time series, the second cell stores the aggregate ofthe second two data items (data items 2 and 3), etc; and soon. In all, it stores the original time series and all the ag-

SBT SATNumber of children 2 ≥ 2

Levels of children for i ≤ ilevel i + 1

Shift at level i + 1: Si+1 2 ∗ Si k ∗ Si, k ≥ 1Overlapping window window size ≥ wi

size at level i + 1: Oi+1 at level i: wi

Table 1. Comparing the Shifted AggregationTree (SAT) with the Shifted Binary Tree (SBT)

gregates for every window size starting at every time pointwithin this time window.

Recall that in a Shifted Binary Tree (SBT), level 0 storesthe original time series, and level i stores the aggregates ofwindow size 2i. So, each node in a SBT has a correspond-ing cell in the aggregation pyramid. Thus the SBT can beembedded in the aggregation pyramid as shown in Figure1.b.

Obviously, there are many other possible embeddingsinto the aggregation pyramid. By using different structureson different data inputs, we can achieve optimal perfor-mance by trading off structure maintenance against filter-ing selectivity.

2.2. Shifted Aggregation Tree Generalizes ShiftedBinary Tree

Like a Shifted Binary Tree, a Shifted Aggregation Treeis a hierarchical tree structure defined on a subset of thecells of an aggregation pyramid. It has several levels, eachof which contains several nodes. The nodes at level 0 arein one-to-one correspondence with the original time series.Any node at level i is computed by aggregating some nodesbelow level i. Two consecutive nodes at the same level over-lap in time.

A Shifted Aggregation Tree (SAT) is different from aShifted Binary Tree (SBT) in two ways:

• The parent-child structure defining the topological re-lationship between a node and its children, i.e. howmany children it has and their placements.

• The pattern defining how many time points apart are(called the shift) two neighboring nodes at the samelevel.

Table 2.2 gives a side-by-side comparison of the differ-ence between a SAT and a SBT. Clearly, a SBT is a specialcase of a SAT. Figure 1.c shows one example of Shifted Ag-gregation Trees.

The Shifted Aggregation Tree shares an important prop-erty with the Shifted Binary Tree: any window of size w,w ≤ hi − si + 1, is included by a node at level i, where

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Better Burst

hi is the corresponding window size of level i, and si is theshift of level i. This shared property yields a similar detec-tion algorithm to that of a Shifted Binary Tree. When newdata points come, the nodes in a Shifted Aggregation Treeare updated from bottom up. After a node at level i+1 is up-dated, if the new aggregate is more than f(hi − si + 2), adetailed search is performed on the subwindows of size w,hi−si +2 ≤ w ≤ hi+1−si+1+1. An efficient Shifted Ag-gregation Tree should balance between the update time andthe detailed search time.

3. Heuristic state-space algorithm to searchan efficient Shifted Aggregation Tree

To find an efficient Shifted Aggregation Tree (SAT), onecan use a state-space search algorithm, given the sample in-put series and window thresholds. Each SAT is seen as astate. By adding a level onto the top of SAT B, if we canget another SAT A, we say state B can be transformed tostate A. The algorithm starts from a SAT having level 0only, then keeps transforming the candidate set of SATs, un-til a set of final SATs (i.e. those which can detect bursts inall windows of interst) are reached. In order to find the bestSAT, the best-first strategy is used to explore the state space.Each state is associated with a cost. The state with the min-imum cost is picked as the next state to be explored, andthe final SAT with the minimum cost is picked as the de-sired structure.

The cost associated with each state is used to indicatewhich structure to choose in term of running time. We usea theoretical cost model – the expected number of opera-tions – to model the CPU running time. Our model is a sim-ple RAM model: all operations (updates and comparisons)take constant time. The total cost is the sum of the numberof updating operations and the expected number of compar-ison operations, given the sample input series, the windowthresholds and the tree structure. Our experiment shows thatthe theoretical cost model models the actual CPU runningtime well [2].

4. Empirical Results

4.1. Synthetic Data

Two classes of probabilistic distributions widely used tomodel many real world applications were chosen to gener-ate the synthetic data: the Poisson distribution and the expo-nential distribution. For each distribution, we synthesized aset of data with different parameters in a broad range. Wewant to see how different distributions and thresholds af-fect the desired Shifted Aggregation Trees.

Assume that each point in the input time series has amean μ and a standard deviation σ. Assume that for each

window size, the probability to exceed the threshold besome value p. We can infer [2] that the alarm probability Pa

for an aggregate of window size W to exceed the thresoldfor window size w is

Φ((√

T − 1√T

)√

σ+

Φ−1(p)√T

)

where T = W/w, denotes the bounding ratio, and Φ(x) isthe normal cumulative distribution function.

So Pa is determined by the distribution parameters μ andσ, the threshold parameter p, the bounding ratio T and thelevel w in the underlying aggregation pyramid. The experi-ments on both Poisson data and exponential data also showthe following:

• The larger the ratio μ

σis, the larger the alarm probabil-

ity Pa. To mitigate this, the Shifted Aggregation Treebecomes denser in order to bring down Pa. When μ

σ

becomes very large, Pa is close to 1. So the structureturns sparser again to reduce the updating time, but isessentially useless.

• The smaller the burst probability p, the larger thethreshold, the smaller Pa, so the structure becomessparser since there are less bursts to worry about.

• As the bounding ratio T decreases, so does Pa. In aShifted Aggregation Tree, T could be very close to 1,e.g. W = w + 1, whereas T in a Shifted Binary Treeis designed to be about 4. Thus a SAT is able to ad-just its structure to reduce the alarm probability.

• As the size w increases, so does Pa, while the bound-ing ratio T becomes smaller in the Shifted Aggrega-tion Tree to try to reduce Pa.

In summary, because the Shifted Aggregation Tree canadjust its structure to reduce the alarm probability, it canachieve far better running time than the Shifted Binary Tree.

4.2. Real World Data

We have used two real world data sets to test the pro-posed framework: the Sloan Digital Sky Survey (SDSS)SkyServer traffic data which records all the access traffic tothe SDSS SkyServer in 2003, and the NYSE TAQ stock datawhich includes tick-by-tick trading activities of the IBMstock between Jan. 1st, 2001 to May 31st, 2004. All theevents within the same second are aggregated as a singledatum. The statistics show that the SDSS data follows thePoisson distribution while the IBM data is close to the ex-ponential distribution.

We are interested in comparing the Shifted AggregationTree (SAT) with the Shifted Binary Tree (SBT) under dif-ferent settings.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Better Burst

CPU Time vs. Threshold

0

20000

40000

60000

80000

100000

120000

2 3 4 5 6 7 8 9

Burst Probability p=10-k

CP

U T

ime (m

s)

SDSS_SAT

SDSS_SBT

IBM_SAT

IBM_SBT

(a) Threshold

CPU Time vs. Max Window Size of Interest

0

100000

200000

300000

400000

500000

10 30 60 120 300 600 1800

Max Window Size of Interest

CP

U T

ime (m

s) SDSS_SAT

SDSS_SBT

IBM_SAT

IBM_SBT

(b) Maximum window size of interest

CPU Time vs. Set of Window Sizes

0

200000

400000

600000

800000

1000000

1 5 10 30 60 120

Window Size Step

CP

U T

ime (m

s)

SDSS_SAT

SDSS_SBT

IBM_SAT

IBM_SBT

(c) Different set of window sizes of interest

Figure 2. Performance test: CPU time comparison on the Sloan Digital Sky Survey (SDSS) SkyServertraffic data and the IBM stock data

• Different thresholds are set to reflect a burst probabil-ity ranging from 10−2 to 10−9.

• Different maximum window sizes of interest are setfrom 10 seconds up to 1800 seconds.

• Different sets of window sizes of interest are set to bea set of sizes n, 2 ∗ n, 3 ∗ n, ...., where n is set to be 1,5, 10, 30, 60, 120 respectively.

Figure 2 shows the results for both data sets. The ShiftedAggregation Tree performs better than the Shifted BinaryTree by a factor of from 2 to 5 times.

We also studied how sensitive the desired Shifted Ag-gregation Tree is to the training data and how the searchparameters in the state-space algorithm affect the desiredstructure. The experiments show that the training sets withsimilar statistics produce similar structure, while the train-ing set with largely different characteristics from the test-ing set could cause the desired structure to perform poor.It’s also shown that small search parameters work well inthe best-first search algorithm.

5. Related Work

As an interesting and important phenomenon, monitor-ing and mining bursty behaviors is attracting increasing in-terest recently. Due to the space limitations, the referencebibliographies can be found in [2].

Wang et al. model the bursty behavior in self-similar timeseries, such as Ethernet, disk traffic, etc. Kleinberg stud-ies the bursty and hierarchical structure in temporal textstreams to discover how high frequency words change overtime. His work focus on modeling the bursty behaviors,while our focus is a high-performance algorithm to detectbursts, thus complementing his work. Vlachos et al. minethe bursty behavior in the query logs of the MSN search en-

gine. We share the view that burst detection should be a pre-liminary primitive for further knowledge mining process.

Neill et al. study the problem of detecting signifi-cant spatial clusters in multidimensional space. They con-sider a general non-monotonic density function and use theoverlap-kd tree with a fixed overlapping structure indepen-dent of the input data. Our technique can adapt to differentcharacteristics of the input data and be applied to their ap-plication to achieve better performance.

Datar et al. and Gibbons et al. study a related problem:estimating the number of 1’s in a 0-1 stream and the sumof bounded integers in an integer stream in the last N ele-ments. They use synopsis structures called Exponential His-tograms and Waves respectively. Like our Shifted Aggrega-tion Tree, these are multiresolution aggregation structures,though with coarser aggregation levels for the past and fineraggregation levels for recent data.

6. Conclusion

In this paper, we propose a framework for adaptive andtherefore better elastic burst detection. It includes a familyof data structures and a heuristic search algorithm to find anefficient structure given the input. Experiments show an im-provement factor of up to 35 times depending on the input.

References

[1] D. Shasha and Y. Zhu. High Performance Discovery inTime Series: Techniques and Case Studies, pages 151–172.Springer, 2004.

[2] X. Zhang and D. Shasha. High performance burst detection.Technical Report, New York University, 2005.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE


Top Related