[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Better Burst Detection

Post on 30-Mar-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Better Burst Detection

    Xin Zhang Dennis ShashaDepartment of Computer Science

    Courant Institute of Mathematical SciencesNew York University

    {xinzhang,shasha}@cs.nyu.edu

    Abstract

    A burst is a large number of events occurring within acertain time window. Many data stream applications re-quire the detection of bursts across a variety of windowsizes. For example, stock traders may be interested in burstshaving to do with institutional purchases or sales that arespread out over minutes or hours.

    In this paper, we present a new algorithmic frameworkfor elastic burst detection [1]: a family of data structuresthat generalizes the Shifted Binary Tree, and a heuristicsearch algorithm to find an efficient structure given the in-put. We study how different inputs affect the desired struc-tures and the probability to trigger a detailed search. Exper-iments on both synthetic and real world data show a factorof up to 35 times improvement compared with the Shifted Bi-nary Tree over a wide variety of inputs, depending on the in-puts.

    1. Introduction

    A burst is a large number of events occurring withina certain time window. Its a noteworthy phenomenonin many natural and social processes. For example,stock traders are interested in bursts of trading vol-ume, which may reflect hidden news. Astrophysicists areinterested in gamma ray bursts, which may reflect the oc-currence of a supernova. Furthurmore, many data applica-tions require detection of bursts across a variety of win-dow sizes. For example, interesting gamma ray burstscould last several seconds, several minutes or even sev-eral days.

    The elastic burst detection problem [1] is to detect burstsacross multiple window sizes. When the aggregate in a win-dow exceeds the threshold for that window size, a burst oc-curs. A naive algorithm is to check each window size of in-terest one at a time. To detect bursts over k window sizesin a sequence of length N naively requires O(kN) time.

    In [1], the authors show that a simple data structure calledthe Shifted Binary Tree can beat the naive algorithm by sev-eral orders of magnitude when the probability of bursts isvery low.

    A Shifted Binary Tree includes a binary tree as the basestructure. It also includes shifted sublevels to each base sub-level above level 0. The shifted sublevel i is still of length 2i,but the correspondending window is shifted by 2i1 fromthe base sublevel. Figure 1.a shows an example.

    The overlap between the base sublevels and the shiftedsublevels guarantees that all the windows of length w, w 1 + 2i, are included in one of the windows at level i + 1.Let f(w) be the threshold for size w. Because the aggrega-tion function is monotonically increasing, whenever morethan f(2+2i1) events are found in a window of size 2i+1,then a detailed search must be performed to check if somesubwindow of size w, 2 + 2i1 w 1 + 2i, has f(w)events. All bursts are guaranteed to be reported and manynon-burst windows are filtered away without requiring a de-tailed check. Unfortunately, when bursts are rare but notvery rare, the number of fruitless detailed searches grows,suggesting we may want more levels than Shifted BinaryTree provides; conversely, when bursts are exceedingly rarewe may need fewer levels. In other words, we want a struc-ture that adapts to the input.

    This paper presents a family of multiresolution over-lapping data structures, called Shifted Aggregation Trees,which generalizes the Shifted Binary Tree and includesmany other structures. We then present a heuristic search al-gorithm to find an efficient data structure given the sampleinput series and window thresholds. We theoretically ana-lyze and empirically study how different inputs affect thedesired structures and the probability to trigger a detailedsearch. Experiments on both synthetic data and real worlddata show that the Shifted Aggregation Tree outperformsthe Shifted Binary Tree over a variety of inputs, yieldingup to a factor of 35 times speedup in some cases. Due tospace limitations, details and references missing in this pa-per can be found in [2].

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • Level 0

    Level 1

    Level 2

    Level 3

    Level 4

    (a) Shifted Binary Tree

    Level 0Level 1

    Level 2

    Level 3

    Level 4

    (b) Embed a Shifted Binary Tree (SBT) in an Ag-gregation Pyramid (AP). Each shaded cell in theAP corresponds to a node in the SBT. The differ-ent shadings in level 2 show the one-to-one cor-respondence.

    (c) An example of Shifted AggregationTree embedded in the Aggregation Pyra-mid

    Figure 1. Shifted Binary Tree, AggregationPyramid and Shifted Aggregation Tree.

    2. Shifted Aggregation Tree

    2.1. Embed a Shifted Binary Tree in an Aggrega-tion Pyramid

    Aggregation Pyramid is an N -level isosceles triangular-shaped data structure built over a time window of size Nshown in Figure 1.b and 1.c: level 0 has N cells and isin one-to-one correspondence with the original time series;level 1 has N 1 cells, the first cell stores the aggregateof the first two data items (say, data items 1 and 2) in theoriginal time series, the second cell stores the aggregate ofthe second two data items (data items 2 and 3), etc; and soon. In all, it stores the original time series and all the ag-

    SBT SATNumber of children 2 2

    Levels of children for i ilevel i+ 1

    Shift at level i+ 1: Si+1 2 Si k Si, k 1Overlapping window window size wi

    size at level i+ 1: Oi+1 at level i: wi

    Table 1. Comparing the Shifted AggregationTree (SAT) with the Shifted Binary Tree (SBT)

    gregates for every window size starting at every time pointwithin this time window.

    Recall that in a Shifted Binary Tree (SBT), level 0 storesthe original time series, and level i stores the aggregates ofwindow size 2i. So, each node in a SBT has a correspond-ing cell in the aggregation pyramid. Thus the SBT can beembedded in the aggregation pyramid as shown in Figure1.b.

    Obviously, there are many other possible embeddingsinto the aggregation pyramid. By using different structureson different data inputs, we can achieve optimal perfor-mance by trading off structure maintenance against filter-ing selectivity.

    2.2. Shifted Aggregation Tree Generalizes ShiftedBinary Tree

    Like a Shifted Binary Tree, a Shifted Aggregation Treeis a hierarchical tree structure defined on a subset of thecells of an aggregation pyramid. It has several levels, eachof which contains several nodes. The nodes at level 0 arein one-to-one correspondence with the original time series.Any node at level i is computed by aggregating some nodesbelow level i. Two consecutive nodes at the same level over-lap in time.

    A Shifted Aggregation Tree (SAT) is different from aShifted Binary Tree (SBT) in two ways:

    The parent-child structure defining the topological re-lationship between a node and its children, i.e. howmany children it has and their placements.

    The pattern defining how many time points apart are(called the shift) two neighboring nodes at the samelevel.

    Table 2.2 gives a side-by-side comparison of the differ-ence between a SAT and a SBT. Clearly, a SBT is a specialcase of a SAT. Figure 1.c shows one example of Shifted Ag-gregation Trees.

    The Shifted Aggregation Tree shares an important prop-erty with the Shifted Binary Tree: any window of size w,w hi si + 1, is included by a node at level i, where

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • hi is the corresponding window size of level i, and si is theshift of level i. This shared property yields a similar detec-tion algorithm to that of a Shifted Binary Tree. When newdata points come, the nodes in a Shifted Aggregation Treeare updated from bottom up. After a node at level i+1 is up-dated, if the new aggregate is more than f(hi si + 2), adetailed search is performed on the subwindows of size w,hisi+2 w hi+1si+1+1. An efficient Shifted Ag-gregation Tree should balance between the update time andthe detailed search time.

    3. Heuristic state-space algorithm to searchan efficient Shifted Aggregation Tree

    To find an efficient Shifted Aggregation Tree (SAT), onecan use a state-space search algorithm, given the sample in-put series and window thresholds. Each SAT is seen as astate. By adding a level onto the top of SAT B, if we canget another SAT A, we say state B can be transformed tostate A. The algorithm starts from a SAT having level 0only, then keeps transforming the candidate set of SATs, un-til a set of final SATs (i.e. those which can detect bursts inall windows of interst) are reached. In order to find the bestSAT, the best-first strategy is used to explore the state space.Each state is associated with a cost. The state with the min-imum cost is picked as the next state to be explored, andthe final SAT with the minimum cost is picked as the de-sired structure.

    The cost associated with each state is used to indicatewhich structure to choose in term of running time. We usea theoretical cost model the expected number of opera-tions to model the CPU running time. Our model is a sim-ple RAM model: all operations (updates and comparisons)take constant time. The total cost is the sum of the numberof updating operations and the expected number of compar-ison operations, given the sample input series, the windowthresholds and the tree structure. Our experiment shows thatthe theoretical cost model models the actual CPU runningtime well [2].

    4. Empirical Results

    4.1. Synthetic Data

    Two classes of probabilistic distributions widely used tomodel many real world applications were chosen to gener-ate the synthetic data: the Poisson distribution and th

Recommended

View more >