[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Approximately Processing Multi-granularity Aggregate Queries over DataStreams

Shouke Qin Weining Qian Aoying Zhou∗

Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China{skqin,wnqian,ayzhou}@fudan.edu.cn

Abstract

Aggregate monitoring over data streams is attractingmore and more attention in research community due to itsbroad potential applications. Existing methods suffer twoproblems, 1) The aggregate functions which could be mon-itored are restricted to be first-order statistic or monotonicwith respect to the window size. 2) Only a limited numberof granularity and time scales could be monitored over astream, thus some interesting patterns might be neglected,and users might be misled by the incomplete changing pro-file about current data streams. These two impede the de-velopment of online mining techniques over data streams,and some kind of breakthrough is urged. In this paper,we employed the powerful tool of fractal analysis to en-able the monitoring of both monotonic and non-monotonicaggregates on time-changing data streams. The monotonyproperty of aggregate monitoring is revealed and monotonicsearch space is built to decrease the time overhead for ac-cessing the synopsis from O(m) to O(log m), where m isthe number of windows to be monitored. With the help ofa novel inverted histogram, the statistical summary is com-pressed to be fit in limited main memory, so that high aggre-gates on windows of any length can be detected accuratelyand efficiently on-line. Theoretical analysis show the spaceand time complexity bound of this method are relatively low,while experimental results prove the applicability and ef-ficiency of the proposed algorithm in different applicationsettings.

1. Introduction

Query and mining over data streams have been attractingmuch attention in database community recently because ofthe motivation from real-life applications. As well-known,the conventional database technologies were invented for

∗This work is partially finished when the author is visiting ComputerScience Division of University of California at Berkeley.

the management of collections of static data, but query-ing and mining dynamic data streams are still challenges.The constraints imposed on data stream processing includelimited memory consumption, sequentially linear scanning,and on-line accurate result reporting over evolving datastreams. One of the most interesting topics is querying andmonitoring aggregates with a user specified threshold, in-cluding frequent items [12] or itemsets [19, 16] mining,burst detection [21, 20], change detection [5], and aggre-gate monitoring over time windows [7].

In this paper, we focus on a specific form of query re-quests, say, given a set of given windows wi, i = 1, 2, ..., m,report current timestamp and the all windows wp if an ag-gregate over wp, denoted as F (wp), is larger than the pre-defined threshold T (wp). It has wide applications in mon-itoring network traffic data [8], Gamma ray data [21], andWeb log mining [7].

However, existing techniques suffer from two majorproblems when being applied to real-life applications. First,they are designed for processing aggregations with distribu-tive property (e.g., count) but not for algebraic aggregatessuch as average or variance. The significant difference be-tween algebraic and distributive aggregations lies in that thelater is monotonic with respect to the time dimension, whilethe former is not. However, algebraic aggregate functionsare used widely in analysis of streaming data. For example,the following four models are frequently used to remove theeffect of noisy data: Moving Average, S-shaped MovingAverage, Exponentially Weighted Moving Average, Non-Seasonal Holt-Winters [14]. Though some methods are pro-posed to maintain variance and K-medians over a stream[3], the cost for finding bursts using these synopsis is stillO(m). Thus, the method cannot support multi-granularityaggregate queries efficiently, especially when the number ofwindows is large.

Second, most existing techniques are designed for query-ing over a single window or a small number of windows.Since users often do not know the most appropriate win-dow size they need to monitor, they tend to monitor a largenumber of windows simultaneously. Although processing

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

multiple queries separately is a trivial solution, and aggre-gate window of continuous queries can be shared [2], it in-evitably results in redundant storage and computation, andfinally bad performance.

To address these two problems, and considering the ba-sic requirements of data stream processing, in this paper, wepropose a novel approximate method for continuously mon-itoring a set of aggregate queries with different window sizeparameters over a data stream.

1.1 Related Work

Monitoring and mining data stream has received consid-erable attention recently [13, 21, 8, 14, 5, 7, 20]. Klein-berg focuses on modelling and extracting structure from textstream in [13]. Our work is different in that we focus on dis-tributive and algebraic aggregates on different sliding win-dows. There are also some works about finding changes indata stream [8, 14, 5]. Although their objectives and appli-cation backgrounds are different, one common feature theyall bear is that they can monitor the bursty behavior on onlyone granularity. The method in [21] uses a Shifted WaveletTree (SWT) to detect bursts on multiple windows. In 2005,a more efficient aggregate monitoring method is put forwardin [7], which uses a novel index schema to detect burstybehaviors. However, these two methods can only detectmonotonic aggregates, and the maximum window size andthe number of monitored windows are both limited. Sincethey are both based on the assumption that for two windowswith different size, if wj is contained in wj , there must beF (wi) > F (wj). Actually, it holds only for F which ismonotonic with respect to the window size, such as sumor count. For non-monotonic F such as average, the as-sumption does not hold. An adaptive burst detecting methodbased on Inverted Histogram is proposed in [20]; however,no any overall bound for the error of burst detection is giventhere.

Fractal geometry was known to be suitable to describethe essence in the complex shapes and forms of nature[15]. As a tool, fractal techniques have been success-fully applied to the modelling and compression of data sets[18, 11, 17, 4]. However, existing fractal techniques are dif-ficult to be employed directly in the data stream scenario,for the building of a fractal model is of polynomial timecomplexity, and multiple scans of the whole data set is nec-essary.

Our contributions can be summarized as follows:

• We incorporate the tool of fractal analysis into aggre-gate query processing. Then our method provides theability for processing not only the distributive aggre-gates but also the algebraic ones. We also propose amethod for processing data which does not obey thescaling relationship of exact fractal models.

• The monotonic property of the synopsis search spaceis described. To construct such a monotonic searchspace for multi-granularity aggregate query process-ing, a novel approach is presented, which could de-creases the time overhead of query processing fromO(m) to O(log m), where m is the number of win-dows being monitored.

• An efficient synopsis, called Inverted Histogram (IH),is employed, and the algorithm for query processing isgiven. We analyze the storage and computation cost,and prove the error bound of our algorithm. We alsocomplement our analytical results with an extensiveexperimental study. The results demonstrate that theproposed method can monitor multi-granularity aggre-gate queries accurately with high efficiency.

The remainder of this paper is organized as follows. Sec-tion 2 is for the preliminary knowledge of fractal and scal-ing relationship. Then the problem statement is presented inthis section. Section 3 introduces the concept of monotonicsearch space and presents the method for building such asearch space. The details of aggregates monitoring algo-rithm, including the maintenance of IH and the extension ofbasic algorithm, is introduced in Section 4. Section 5 showsthe results and then analyzes the experiments. Finally, Sec-tion 6 is for concluding remarks.

2 Preliminaries and Problem Statement

Fractal analysis is a powerful tool for analyzing time se-ries data and data streams. Power-law scaling relationship,a basic characteristic of fractals, could be adopted to handledata streams.

Pieces of a fractal object when enlarged are similar tolarger pieces or to that of the whole [15]. If these piecesare identical re-scaled replicas of the others, the fractal isexact. When the similarity is presented only in a statisticalsense, and it is statistically observed on the given featurewith different granularity, the fractal is statistical. Deter-mined fractals such as the von Koch curve and Cantor setare both exact; however, most natural objects are statisticalfractals.

When a quantitative property, q, is measured on a timescale s, its value depends on s according to the followingscaling relationship [15]:

q = psd (1)

This is called power law scaling relationship of q with s. pis a factor of proportionality and d is the scaling exponent.The value of d can be easily determined as the slope of thelinear least squares fitting to the data pairs on the plot oflog q versus log s:

log q = log p + d log s (2)


Data points for exact fractals are lined up along the re-gression slope, whereas those of statistical fractals scatteraround it since the two sides of equation (2) are equal onlyin distribution.

For discrete time series data, equation (1) can be trans-formed to q(st) = sdp(t) where t is the basic time unit ands(s ∈ R) is scaling measurement [6]. Simply, d can bedetermined by d = log(q(st)/p(t))

log s . The power-law scalingrelationship is also obeyed when q is an aggregate functionor some second order statistic values. The details will notbe discussed here due to the space limitation.

A data stream X can be considered as a sequence ofpoints x1, ..., xn with subscript in ascending order. Eachelement in the stream can be a value in the range [0..R]. Fcan be not only monotonic aggregates with respect to thewindow size (e.g., sum and count) but also non-monotonicones such as average. Thereafter, it can be seen that F canbe extended to some even more complicated statistic values,such as variance.

To monitor aggregations on evolving data streams is infact to answer aggregate queries continuously on subse-quences (windows) with different lengthes. The formal def-inition of a monitoring task over a data stream can be de-scribed as follows.

Definition 1 Assuming that F is an aggregate function, wli

is the latest sliding window of a stream with the length of li,1 ≤ li ≤ n, and T (wli) is the threshold on F (wli) givenby users beforehand. The result is reported on window wli

when xn comes and if F (wli) ≥ T (wli).

Here, the definition is similar to the burst defined in [21,7]. Compared with previous works, the major differencesof ours lie in: 1) F can be algebraic aggregates, 2) a largenumber of windows can be monitored simultaneously.

To reach such a goal, a brute-force approach is to checkthe synopsis for each window whenever a new point arrives.However, with this approach the cost of answering a queryis O(m), where m is the number of windows. Though someprevious effort tried to reduce the cost for maintaining thesynopsis, it is argued that reporting query result from synop-sis also need substantial cost. The cost is extremely expen-sive, especially when monitoring large number of windows.

We propose an alternative method which organize all thewindows into a monotonic search space. Here, a searchspace is a permutation of all the windows. Our basic idea isto sort the windows, so that when a result is reported on awindow, all its predecessors should be reported as results.We call such search space as a monotonic search space.Thereby, a binary search can be applied on the monotonicsearch space, and the time complexity for accessing synop-sis is O(log m).

Figure 1. Sketch map for power law scaling ofaggregate function F .

3 Construction of a Monotonic Search Space

3.1 Monotonic Search Spaces over Exact Fractals

Here, it is assumed that the monitored data stream X isexact fractal. The extensions of the basic method for han-dling statistical fractals and non-fractal data will be givennext.

The aggregate result on unit time scale is denotedby F (1) (F (1) > 0). A line with slope d =log(F (wi)/F (1))/ log(wi) can be obtained on stream X .The line can be drawn on a sketch map as Fig.1 shows.Here, d is the scaling exponent of F . Different F has dif-ferent d.

Assuming that d is known in advance. T (wi) and T (wj)are two user-given thresholds on window wi and wj respec-tively. We plot the two points (log(wi), log(T (wi))) and(log(wj), log(T (wj))) on the sketch map. Provided that thepoint (log(wi), log(T (wi))) is over the line d and the point(log(wj), log(T (wj))) is below the line d, that is to say, thefollowing slope inequalities hold:

log(T (wi)/F (1))log(wi)

> d =log(F (wi)/F (1))

log(wi)

log(T (wj)/F (1))log(wj)

< d =log(F (wj)/F (1))

log(wj)

then, when we have a detection for changes on stream X ,there must be a result reported on window wj and no resulton window wi. The reason behind is that log(T (wi)) >log(F (wi)) is always holds since log(T (wj)/F (1)) <log(F (wj)/F (1)) is guaranteed on the overall data stream.

For extending to a more complex condition, assum-ing that d is not known a priori, and m thresholdsT (w1), T (w2), ..., T (wm) on m different windows aregiven by users beforehand. To monitor the aggregates onthose windows, we first sort these levels with their ratioslog(T (wi)/F (1))/log(wi) in ascending order. This impliesthat F (wi)/T (wi) i = 1, 2, ..., m are ordered in ascendingorder. Hence, the property of the search space follows:


Property 1 T (w1), T (w2), ..., T (wm) are m thresholdsgiven for m different windows w1, w2, ..., wm, and F (1) isthe aggregate result of the unit time scale. Provided thatlog(T (wi)/F (1))/ log(wi) is ordered in ascending order,a window wi is a result to be reported implies that all itspredecessors are results to be reported.

With this property, we can sort the search space anddecrease the time overhead in great deal when answeringaggregate monitoring queries. It endows the stream moni-tor with high scalability for the tough monitoring tasks onfluctuating data streams. However, this property is strictlyabided by exact fractals. For statistical fractals or non-fractal data, the scaling exponent d in equation (1) is nota constant on overall data stream. As a result, the order ofthresholds is variable from point to point in a data stream.This brings great challenges for using this monotony prop-erty in real life applications.

3.2 Approximating Real-life Data with Fractals

Real-life data may not be exact fractals. For statisticfractals in real world applications, this property is abidedin distribution. That is to say, the rule is only obeyed by thestatistic variables on the data stream, such as expectationand variance.

In this case, we sort the thresholds with Property 1; itmeans that we use exact fractals to approximate statisticfractals. This approximation will induce two kinds of er-rors. One is the error of fractal approximation, denoted byerrFA; the other is the error of missorting the thresholds,denoted by errMS .

The fractal approximation error errFA is induced bytransforming real-life data stream to exact fractal which fol-lows the power law scaling relationship. Let the aggregateresult on ith-level window wi be F (wi) and its expectationand standard deviation be E(F (wi)) and D(F (wi)). Weassume that F (wi)−E(F (wi))

D(F (wi))∼ Norm(0, 1). The thresh-

old given by users on F (wi) is denoted by T (wi). Becausethe fractal transformation uses E(F (wi)) to approximateF (wi) for scaling in power law relationship, this will notinduce any error when F (wi) < T (wi). However, whenF (wi) ≥ T (wi), the change of aggregate on window wi

may be neglected. errFA is the probability of such kind oferror is occurring on wi.

errFA = P (F (wi) ≥ T (wi))

= P (F (wi) − E(F (wi))

D(F (wi))≥ T (wi) − E(F (wi))

D(F (wi)))

Therefore, the errFA is limited when T (wi) is given.Having different purpose in comparison with other ap-

proximation techniques, such as histogram, wavelet andFFT(Fast Fourier Transformation), the fractal approxima-tion is not used to compress data and reconstruct them

(a) (b)

Figure 2. (a) Power law scaling relationship onreal life data stream. (b) Piecewise monotonicsearch space on real life data stream.

later. So, the errFA will not be affected by the distancebetween fractal data and original data but the deviation of agiven threshold to the expected value of the aggregate resulton corresponding window. The distance error to originaldata will be considered in our synopsis data structure—IH,which will be discussed later.

The missorting error errMS originates from using in-variant F (1) to sort the thresholds. For exact fractals, thevalue of F (1) is invariant. However, for statistic fractalsthe F (1) is time-changing. For example, if the monitoredaggregate function F is sum, the value of F (1) is equalto 1

n

∑ni=1 xi which is kept varying, induced by the newly

coming xn of stream X . Thus, constructing a monotonicsearch space with an invariant F (1) as in exact fractals doesinduce errors consequently.

Let us consider a simplified case shown in Fig.2.(a),where two thresholds Ti and Tj are given by users for twodifferent levels of windows. It can be seen that the mono-tonic search space achieved on d1 will not hold on d2. Thereason is that the two scaling relationships have differentlog(F (1)). So, the order of monitored windows based onlog(T (wi)/F (1))/log(wi) is variable on different scalingrelationships. It will be explained in details as follows.

The value of point B on the vertical axis is log(F (1)) ofd1. Because the slope of the line BTi is smaller than that ofBTj , the monotonic search space is 〈wi, wj〉. It means thatwhen the aggregate monitoring query is submitted, we firstdetect the time scale wi. If the result of aggregate monitor-ing query is found on wi, the time scale wj will be detectedin the second step. Otherwise, the query processing will haltand no result reported. In this way, 〈wi, wj〉 can support thequeries on d1. However, when a new point xn+1 comes,the current power-law scaling relationship d2 is obtained.The value of point A on the vertical axis is log(F (1)) ofd2. Since the slope of line ATi is larger than that of ATj ,the monotonic search space of d2 is 〈wj , wi〉. Apparently,it conflicts with the monotonic search space of d1.

From the analysis above, a rule ensuring the invariantmonotonic search space could be observed and summarizedas follows. Assuming that the line TjTi intersects the ver-


Figure 3. Building monotonic search space.

tical axis at point D. If the point log(F (1)) of any di isabove point D on the vertical axis, the di has a same mono-tonic search space 〈wi, wj〉 as that of d1. Similarly, if thepoint log(F (1)) of any di is below point D on the ver-tical axis, the di has an inverted monotonic search space〈wj , wi〉, compared with that of d1.

Property 2 Given two thresholds T (wi) and T (wj).Suppose the level of wi is smaller than that of wj ,and the line through points (log(wi), log(T (wi))) and(log(wj), log(T (wj))) intersects the vertical axis at D.Then, if log(F (1)) is above D, the monotonic search spaceis 〈wi, wj〉; otherwise, if log(F (1)) is below D, 〈wj , wi〉 isa monotonic search space.

The simple rule observed hereinabove is only adaptableto double thresholds. Here, it is extended to a more com-plex case with three thresholds. For the cases with morethan three thresholds, the extension is a easy and trivial job.As shown in Fig.2.(b), the three thresholds are plotted as Ti,Tj and Tk. Pairwise connecting the three points, lines TiTj ,TiTk and TjTk can be obtained. They intersect the verti-cal axis at points A, B and C, respectively. The verticalaxis can be upwards divided into four disjointed intervals(−∞, A], (A,B], (B,C] and (C, +∞). For the log(F (1))falling in (A,B], 〈wi, wj〉, 〈wk, wi〉 and 〈wk, wj〉 can beobtained based on Property 2. These binary monotonic re-lationships can be merged into 〈wk, wi, wj〉 which is themonotonic search space of the interval (A,B].Proposition 1 Given m thresholds {T (wi)}m

i=1 and cor-responding points {(log(wi), log(T (wi)))}m

i=1. Draw thelines connecting any pair of points, then the vertical axis isdivided into several intervals. Each interval has a uniquemonotonic search space.

No matter what the value of log(F (1)) on scaling re-lationship di is, it can be located at one of the intervals.Then, the monotonic search space of the corresponding in-terval can be used to support aggregate monitoring queries.A novel method to build the intervals and their monotonicsearch spaces will be presented in next subsection.

3.3 Building the Search Space

Considering the case of three thresholds in Fig.2.(b) asan example. The vertical axis is plotted as Fig.3. Supposethe line TiTj connecting the pair (Ti, Tj) intersects withthe axis log(F (wi)) at point A, where wi < wj . The pair(Ti, Tj) is called the divide condition of divide point A. Forthe other two points B and C, it could be deduced similarly.

Algorithm 1 buildMonotonicSearchSpace( T (wi), wi )1: create a new point Ti;2: Ti ← (log(wi), log(T (wi)));3: if m ≥ 1 then4: for j = 1 to m do5: line Ti and Tj intersects vertical axis at Di;6: if wi < wj then7: add 〈wj , wi〉 to the leftwards intervals of Di;8: add 〈wi, wj〉 to the rightwards intervals of Di;9: else

10: add 〈wi, wj〉 to the leftwards intervals of Di;11: add 〈wj , wi〉 to the rightwards intervals of Di;12: end if13: end for14: end if15: increase m by 1;

Binary monotony relationships is generated first and thenmerged finally. If a log(F (1)) falls in the interval (A,B],and the < is given to the divide conditions of all the left-ward divide points of interval (A,B], the binary monotonyrelationship 〈wi, wj〉 can be obtained. When given the >to the divide conditions of all the rightward divide points ofinterval (A,B], the binary monotony relationship 〈wk, wi〉and 〈wk, wj〉 can be obtained. Then, all the three binarymonotony relationships could be merged, and 〈wk, wi, wj〉is formed, which is the monotonic search space for the in-terval (A,B].

The procedure described here can be performed off-line,for the order of thresholds are always determined before on-line monitoring aggregates. It can also be executed onlinefor inserting some new thresholds instantly given by users.Therefore, Algorithm 1 is in fact an incremental algorithm.

Algorithm 1 has two input parameters T (wi), wi and nooutput. Given a new threshold T (wi), it obtains a new pointTi in Line 2. Then, at Line 5, it draws a line through pointsTi and Tj which intersects the vertical axis at point Di. Inthe process of updating the monotonic search space of eachinterval, from Line 4 to 13, the algorithm adds the newlygenerated binary monotonic relationship into each entry.

The time cost of Algorithm 1 for adding a new windowinto existing search spaces is O(m), where m is the numberof monitored windows at present. The space cost of Algo-rithm 1 is very small. This is because the entries are storedin hard disk and need not to be maintained in memory. Fur-thermore, the distribution of log(F (1)) can be obtained, weneed only to store the entries covering the possible value oflog(F (1)).

4 Aggregate Monitoring

In this section, an efficient monitoring algorithm is pro-posed to monitor both monotonic and non-monotonic ag-gregates over time-changing data streams. This algorithm


Algorithm 2 monitorAggregatesFalsePositively1: locate xn at interval Ik;2: A ← retrieve monotonic search space of Ik;3: low ← 0, high ← A.size() − 1;4: while low ≤ high do5: mid ← (low + high)/2;6: if T (A[mid]) ≤ComputeAGG(A[mid]) then7: low ← mid + 1;8: else9: if T (A[mid]) >ComputeAGG(A[mid]) then

10: high ← mid − 1;11: end if12: end if13: end while14: return high;

can monitor the changes of aggregate results on multiplegranularity with sub-linear search space. To our knowledge,the existing research works have to monitor those windowswith linear search space, resulting in the poor scalability forthe stream monitoring. Furthermore, the algorithm could beextended to monitor algebraic aggregates such as variance.The proposed algorithm can be used as a framework to of-fer efficient solutions for other kinds of monitoring tasks aswell.

When a new point xn is arriving and being insertedinto the histogram, the Algorithm 2 is invoked for an-swering aggregate monitoring query over current stream.The algorithm has no input parameter. The output is thealarm information if any. Here, we turn to the windowswhich is experiencing changes. The algorithm performbinary search on the monotonic search space from Line4 to Line 13 for answering aggregate monitoring queries.ComputeAGG(A[mid]) is the process for retrieving F (wi)(wi = A[mid]) from IH false positively or false negatively.

Algorithm 2 can explore the search space of the overallstream in O(log m) time, where m is the number of moni-tored windows. If the concern is only about whether or notsome querying result occurs when xn comes, rather thanon which window the change occurs or how many changesare induced on the overall stream, the time cost could beO(1). We can claim the following theorem when assumingthe space cost of a B-bucket histogram is B without lossany generality.

Theorem 1 Algorithm 2 can monitor aggregates in realtime on m granularity over a data stream in O(n log m)time and O(B) space.

This algorithm can be implemented upon any synopsisstructure provided that it is for computing aggregate results.IH is one of these synopses. It is actually a novel histogramwe proposed for detecting aggregation bursts. Some de-tails will be presented in next subsection. The aggregateresult F (wi) on the latest i-length window (1 ≤ i ≤ n)

can be obtained from IH. In practice, F (wi) = fss(i),and fss(i) = Σn

j=n−i+1xj , that is to say, it is the sum ofthe last i values of stream X , called suffix sum. Due tothe limited space available for computing aggregates over adata stream, two alternative approaches, namely, false posi-tive oriented and false negative oriented, could be employedto approximate aggregate computing. The former answersaggregate query on wi with a value not less than the realF (wi), whereas the latter answers aggregate query on wi

with a value not exceeding the real F (wi). In real world, itdepends on application which one of the false positive ori-ented and false negative oriented algorithm is preferred.

It can be seen that Algorithm 2 is a framework for mon-itoring both distributive and algebraic aggregates. The kindof aggregate which can be monitored by Algorithm 2 is de-termined by the kind of aggregate which can be computedby IH. Thus, in the following, IH is introduced first, andthen the aggregate monitoring tasks are analyzed.

4.1 Aggregate Estimation Using IH

Now, we are at the point to introduce a novel synopsis,called Inverted Histogram (abbrev. IH). It is used to ap-proximately calculate the aggregation value when given awindow. Though existing approximate V-optimal histogram[10] can also do this job, it suffers a lot of workload to mon-itor aggregate. The reason is that the absolute error and sizeof the bucket constructed afterwards are getting larger andlarger as the stream proceeds. Thus, the errors induced byusing the newly created buckets are also increasing. How-ever, for our purpose, what we expect is that the recentbucket is of high precision, in other words, the width of therecent buckets should be smaller than that of old ones.

Our basic idea is to invert the order of buckets, andthen the smaller bucket is used to store the newly comingpoints. Fig.4.(b) illustrates the idea [20]. The oldest pointx′

1 and the latest point x′n of stream x′ are in b1 and bB , re-

spectively. In Fig.4.(a) which is also duplicated from [20],x′

i = fps(i), fps(i) is the prefix sum, i.e., fps(i) = Σij=1xj .

In Fig.4.(b), x′i = fss(i), and fss(i) is the suffix sum, i.e.,

fss(i) = Σnj=n−i+1xj . Fig.4.(a) is a bucket series of ap-

proximate V-optimal histograms; therefore, the size and theabsolute error of the last bucket is getting larger and largerwith passage of time.

As in the approximate V-optimal histogram, the goalsfor IH are also the minimal number of buckets and theguaranteed relative error in each bucket. The only differ-ence is that the stream could be regarded as x′

i = fss(i).The stream x′ can be partitioned into B intervals(buckets),(ba

1 , bb1), ..., (b

aB , bb

B), where bi(1 ≤ i ≤ B) is the i-thbucket; ba

i and bbi are the minimum and maximum value

within bucket bi, respectively. In some cases, it’s possiblethat ba

i = bbi . Here, we analyze the bound of the relative

error δ in each bucket. The maximum relative error in bi is


(a) Normal Buckets Order

(b) Inverted Buckets Order

Figure 4. Buckets Orders of Approximate V-optimal Histogram and IH

Algorithm 3 updateIH( xn )1: increase bucket number B by 1;2: j ← B;3: create a new bucket and put xn into it;4: for i = B − 1 to 1 do5: add xn to both ba

i and bbi ;

6: if bbi ≤ (1 + δ)ba

j then7: bb

j ← bbi ;

8: add width of bi to width of bj ;9: delete bi;

10: decrease bucket number B by 1;11: end if12: decrease j by 1;13: end for

δ = bbi−ba

i

bai

. In our approach, each bucket should maintain

a pair bai , bb

i and the number of values in it, i.e., its width.It is known that the buckets are disjoint and contain x′

1..x′n.

Therefore, ba1 = x′

1 and bbB = x′

n, and bbi ≤ (1 + δ)ba

i is al-ways tenable in the process of formation. The constructionof IH is described in Algorithm 3.

The only input parameter of Algorithm 3 is xi, andit has no output. When a new value xn comes, all theO( log n+log R

log(1+δ) ) buckets are updated, as illustrated by state-ments from Line 4 to 13. First, Line 3 creates a new bucketfor xn and puts the bucket last. Then, Line 5 updates themaximum and minimum values of all the buckets (fromthe newly created one to the oldest one) by adding xn intothem. In the process of updating, from Line 6 to 11, thealgorithm merges two consecutive buckets bi and bj whenbbi ≤ (1 + δ)ba

j , with bbj = bb

i . Then, the maximum relativeerror in each bucket of IH can be bounded and the space andtime cost are kept low. So, we have the following theorem.

Theorem 2 Algorithm 3 can construct an IH withO( log n+log R

log(1+δ) ) space in O(n(log n+log R)log(1+δ) ) time, and the rel-

ative error in each bucket is at most δ.

As mentioned previously, ComputeAGG(wi) is invokedto retrieve F (wi) from IH with false positive guaranteed orfalse negative guaranteed when needed. The relative errorof its return value is bounded by δ. The precision of the ap-proximate aggregate computation can be improved by con-sidering the values within a bucket to be equidistant. Toguarantee false positive or false negative detection, maxDand minD need to be maintained within each bucket.maxD is the maximum distance between two consecutivepoints within the bucket; minD is the minimum distancebetween two consecutive points within the bucket. Pro-vided that the real value of wi is within bucket bj , the falsepositive aggregate value returned by ComputeAGG(wi) ismin(bb

j , baj +maxD(i−Σi−1

k=1Wid(bk)−1)), and the falsenegative aggregate value returned by ComputeAGG(wi) isbaj + minD(i − Σi−1

k=1Wid(bk) − 1).Up to now, it has been shown that IH can be used to mon-

itor distributive aggregates. Here, the monitoring task withalgebraic aggregate, say, variance, will be discussed. Thevariance in a w-length window is 1

w

∑wi=1(xi−μ)2, where

μ = 1w

∑wi=1 xi. It can be transformed to 1

w

∑wi=1 x2

i −μ2.As we know,

∑wi=1 xi can be obtained from IH. With a new

variable yi and yi = x2i ,

∑wi=1 x2

i can also be computed byusing IH with guaranteed error. Consequently, for monitor-ing variance on wi, we need two histograms. One is forcomputing

∑wi=1 xi, the other is for computing

∑wi=1 x2

i .Both of them are bounded by maximum relative error δ.

4.2 Error Bound Analysis

This section provides a theoretical analysis for the finalresults of aggregate monitoring, and presents some analysison the resulting error for using IH.

Suppose w1 is a monitored window with a thresholdT (w1). The process of answering aggregate monitoringquery on window w1 is to detect whether F (w1) is largerthan or equal to T (wi) or not. The probability of false alarmwith our method can be evaluated as P (Err) = 1 − (1 −errMS)(1−errFA)(1−errIH), where errMS , errFA anderrIH are the error probabilities of missorting, fractal ap-proximation and estimated value of IH, respectively. Theyare independent one another. errMS and errFA have beendiscussed in section 4. Since the accurate monotonic searchspace can always be maintained, we could have errMS = 0.Thus, P (Err) = errFA + errIH − errFAerrIH , whereerrFA is bounded by the deviation of a given threshold tothe expected value of aggregate result F (wi) on window wi.That is to say, errFA = P (x ≥ λ), where x ∼ Norm(0, 1)and λ = T (wi)−E(F (wi))

D(F (wi)).

With the false positive or false negative oriented ap-proaches, the error bound errIH for aggregate monitoringquery answering could be analyzed as follows.

Assuming that x1 = F (w1) and the estimated value ofF (w1) with IH is x′

1. With the maximum relative error


bounded by δ, F (w1) can be estimated by using IH to bein [(1 − δ)F (w1), (1 + δ)F (w1)]. For false positive aggre-gate monitoring, the error probability is

errIH = p · x1(1 + δ) − T (w1)x1δ

(3)

The error probability of false negative oriented approach is

errIH = (1 − p) · T (w1) − x1(1 − δ)x1δ

(4)

Notice that the parameter p measures the occurrence ofburst in a data stream. Smaller p implies that more burstsoccur in the data stream. Accordingly, the false positivemethod is suitable for monitoring fluctuant data stream,while the false negative method is suitable for relatively dor-mant data streams.

Suppose R(T (wi)) is a set of real results of aggregatemonitoring query, which could be obtained accurately withthreshold T (wi) on window wi over the stream x, and wedo not consider the error errFA induced by fractal ap-proximation, for errFA and errIH are independent eachother. Similarly, suppose FP (T (wi)) is a set of results ofaggregate monitoring query, which are computed approx-imately by using our false positive oriented method withthreshold T (wi) on the window of the same length. Thefalse positive method can provide a (1 + ε)-approximation(ε ∈ (0, 1)) aggregate monitoring query processing. Itmeans that given a threshold T (wi), the number of queryresults detected by the false positive method with T (wi) areat most 1+ε times of the query results detected by the accu-rate method with T (wi). Such a guarantee can be stated as,‖R(T (wi))‖ ≤ ‖FP (T (wi))‖ ≤ (1 + ε)‖R(T (wi))‖. Ifwe set T (w1)

x1= 1 + ε. The equation (3) can be transformed

to errIH = p1+δ−T (w1)

x1δ = p δ−ε

δ .

Lemma 1 Given ε, when δ > ε the false positive methodcan provide a (1 + ε)-approximation aggregate monitoringquery processing with at most the false alarm probabilityp δ−ε

δ .

Lemma 2 Given ε, when δ ≤ ε the false positive methodcan provide a (1 + ε)-approximation aggregate monitoringquerying processing with no false alarm.

Theorem 3 Given ε, λ and δ, the error probability ofthe proposed method for answering aggregate monitoringquery on window wi can be bounded by P (x ≥ λ) +max(p δ−ε

δ , 0) − P (x ≥ λ) · max(p δ−εδ , 0) with false pos-

itive oriented IH, where x ∼ Norm(0, 1).The similar conclusions can be achieved for the false

negative oriented method.

5. Performance Evaluation

The algorithms proposed in this paper are implementedwith Microsoft Visual C++ Version 6.0. All experiments

are conducted on a Windows 2000 platform with 2.4GHzCPU and 512MB main memory. The algorithms are runwith a variety of data sets. Due to the limitation of space,only the results for two representative data sets are reportedhere. The two data sets are:

• Network Traffic (Real): The data set BC-Oct89Ext4,called D1 here, is a network traffic tracing data set ob-tained from the Internet Traffic Archive [1].

• Network Traffic (Synthetic): This synthetic data set isfor testing the scalability of our method. It is gener-ated by setting the burst arrival of a data stream witha pareto[9] distribution, as used in simulating networktraffic where packets are sent according to ON OFFperiods. The density function of pareto distribution isP (x) = aba

xa+1 , where b ≥ x and a is the shape pa-rameter. The expected burst count, E(x), is ab

a−1 . Thetuple arrival rate λ1 is driven by an exponential distri-bution, and the interval λ2 between signals is also gen-erated with exponential distribution. In this data set,the expected value E(λ1) = 400tuples/s, E(λ2) =500tuples, a = 1.5, b = 1. The size of this time seriesdata set is n = 100, 000, 000s. The whole data set iscalled D2 here.

In the experiments, two accuracy metrics are used, recalland precision. Recall is the ratio of true alarms raised to thetotal true alarms which should be raised. Precision is theratio of true alarms raised to the total alarms raised.

To set the threshold for F (wi) on i-length window, wecompute moving F (wi) over some training data. The train-ing data is the foremost 10% part of each data set. It formsanother time series data set, called y. The absolute thresh-olds are set to be T (wi) = μy + λσy , where μy and σy arethe mean and standard deviation, respectively. The thresh-old can be tuned by varying the prefactor λ of standard de-viation. The length of windows are 4, 8, ..., 4 ∗ NW timeunits, where NW is the number of windows. NW variesfrom 50 to 1000. For no latency, the time unit is one datapoint in the data sets.

5.1. Experimental results

We evaluate the precision and the recall of our aggre-gate monitoring method. It is supposed that F (wi) is com-puted false positively by IH, and F is the distributive ag-gregate sum. The experiments are conducted on both theaforementioned two data sets. First, the relationship be-tween accuracy and maximum relative error of IH is stud-ied. In this experiment, we set number of monitored win-dows NW = 100 and prefactor λ = 7. It can be seen fromFig.5 that under all the setting of δ the precision and recallof our method are at least 99%. With the decreasing of rela-tive error, IH can provide better approximation. Therefore,


δ δ

(a) On D1(Real) (b) On D2(Synthetic)

Figure 5. Precision and recall on varying max-imum relative error δ of IH, with NW = 100and λ = 7

λ λ


Figure 6. Precision and recall on varying pref-actor λ, with NW = 100 and δ = 0.01

our method can give highly accurate answers for aggregatemonitoring queries based on IH over data streams.

Second, the relationship between accuracy and thresholdof aggregation is studied. In this experiment, we set numberof monitored windows NW = 100 and maximum relativeerror δ = 0.01 of IH. It can be seen from Fig.6.(a) andFig.6.(b) that under any setting of λ the precision and recallof our method are always above 99%. With the increasingof λ, our method can guarantee better accuracy, for fractalapproximation error errFA is decreasing.

Third, the relationship between accuracy and number ofmonitored windows is studied. In this experiment, we setλ = 7 and δ = 0.01. It can be seen from Fig.7 that withNW increasing, the precision and recall of our method arestill above 99%. The accuracy are getting better and bet-ter with the increasing of NW . That is because the av-erage size of monitored windows is getting larger with theincreasing of NW . The fractal approximation is more accu-rate on large time scales. So, the error errFA is decreasingwith the increasing of NW . It can be concluded that ourmethod can monitor large amounts of windows with highaccuracy over data streams.

Now, we evaluate the precision and recall when apply-ing the proposed method in algebraic aggregate variance.Because of the limitation of space, only a part of results arepresented here. As shown by the first experiment, the accu-racy of monitoring variance is also very high when basedon IH. The settings are NW = 100 and λ = 7. It can be


Figure 7. Precision and recall on varying num-ber of windows, with λ = 7 and δ = 0.01

δ

(a) varying δ on D1 (b) varying NW on D2

Figure 8. Precision and recall on monitoringalgebraic aggregate variance

seen from Fig.8.(a) that the precision and recall are alwaysabove 99% under any setting of δ. The second experimentis conducted with the setting NW = 100 and δ = 0.01. Itcould be observed from Fig.8.(b) that the increase of λ canreduce the fractal approximation error correspondingly. So,the accuracy of monitoring variance is getting better. Inthe third experiment, the settings are λ = 7 and δ = 0.01.The accuracy are getting better with the increase of NW ,as shown in Fig.8.(b). It can be concluded again that ourmethod can monitor large amounts of windows over datastream with high accuracy; moreover, both the distributiveand algebraic aggregates can be monitored using the pro-posed method.

Here, some experiments are designed to compare ourmethod with the most recently appeared work Stardust [7]in the aspects of space and time overhead. Although theStardust can monitor aggregate exactly, while our methodjust do it approximately, our method need only very littlememory and time to answer aggregate monitoring queriesif a very small and acceptable error, say 0.1%, is allowed,which is general in real applications. In comparison withthe accurate method, our method could run more than tensof times fast, and it makes sense in real-time monitoring.Thanks to the generous help from the author of [7], wegot the original code package of Stardust. Then, it makesour comprehensive comparison with Stardust possible. Be-cause Stardust is written in JAVA, for the sake of fairness,all our algorithms are re-implemented in JAVA. Neverthe-less, all parameters setting is kept the same as introducedhereinbefore. That is to say, we set basic window length


(a) Space comparing On D1 (b) Time comparing On D1

Figure 9. Space and Time cost comparingwith Stardust.

W = 1, smallest monitored window length K = 4, and therange of the number of monitored windows is varying in{50, 100, 200, 400, 600, 800, 1000} for Stardust. The othersetting, F is sum, λ = 7, and δ = 0.01. A special settingfor Stardust is the box capacity c = 2.

From Fig.9.(a), it could be seen that the proposed methodis space efficient. The space cost of the method is the sameas that of IH, which is only depends on the stream lengthbut bears no relationship with the number of monitoredwindows. However, Stardust has to maintain all the mon-itored data points and associated index structure at the sametime. It can be seen that the space saved by IH is gettinglarger and larger as NW and stream length increase. Thus,the proposed method is more adaptable to monitor multi-granularity windows over data streams, compared with pre-vious ones. Also, it is time efficient. This is shown inFig.9.(b). With search space enlarging, the processing timesaved by using monotonic search space is getting larger.

6 Discussion and Conclusions

We present a novel method for monitoring aggregateswith multi-granularity windows over a data stream. Fractalanalysis is employed to transform the original data streamto its intermediate form. It is because the application offractal techniques that we can not only process algebraicaggregates but also build a monotonic search space, withwhich the access to the synopsis is dramatically optimized.Inverted histogram is adopted as the core synopsis for themonitoring tasks. It expends limited storage space and com-putation cost for providing enough accurate monitoring re-sult. Both theoretical and empirical result show the accu-racy and efficiency of the proposed method.

The future research could be focused on the followingthree points. First, the proposed method could be extendedto automatically determine the most appropriate windowsto be monitored. Second, the method could be generalized,so that it can handle not only the absolute thresholds butthe relative ones. The last but not the least, some extensioncould be done to this method, the goal is to make it pos-sible to process more complicated queries, such as icebergqueries.

AcknowledgementThe authors would like to thank Dr. Ahmet Bulut for pro-

viding the code of Stardust. This work is partially supportedby the NSFC under grant No.60503034, 60496325, and60496327 and Shanghai Rising-Star Program under grantNo.04QMX1404.

References

[1] Internet traffic archive. http://ita.ee.lbl.gov/.[2] A. Arasu and J. Widom. Resource sharing in continuous

sliding-window aggregates. In Proc. of VLDB, 2004.[3] B. Babcock, M. Datar, R. Motwani, and L. O’Callaghan.

Maintaining variance and k-medians over data stream win-dows. In Proc. of ACM PODS, 2003.

[4] M. Barnsley. Fractals Everywhere. Academic Press, NewYork, 1988.

[5] S. Ben-David, J. Gehrke, and D. Kifer. Detecting change indata streams. In Proc. of VLDB, 2004.

[6] P. Borgnat, P. Flandrin, and P. O. Amblard. Stochasticdiscrete scale invariance. IEEE Signal Processing Letters,9(6):181 – 184, June 2002.

[7] A. Bulut and A. K. Singh. A unified framework for moni-toring data streams in real time. In Proc. of ICDE, 2005.

[8] G. Cormode and S. Muthukrishnan. What’s new: Findingsignificant differences in network data streams. In Proc. ofINFOCOM, 2004.

[9] M. E. Crovella, M. S. Taqqu, and A. Bestavros. Heavy-tailedprobability distributions in the world wide web. A practi-cal guide to heavy tails: STATISTICAL TECHNIQUES ANDAPPLICATIONS, pages 3–26, 1998.

[10] S. Guha, N. Koudas, and K. Shim. Datastreams and his-tograms. In Proc. of STOC, 2001.

[11] J. C. Hart. Fractal image compression and recurrent iteratedfunction systems. IEEE Computer Graphics and Applica-tions, 16(4):25–33, July 1996.

[12] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou. Dynamicallymaintaining frequent items over a data stream. In Proc. ofCIKM, 2003.

[13] J. Kleinberg. Bursty and hierarchical structure in streams.In Proc. of SIGKDD, 2002.

[14] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applica-tions. In Proc. of IMC, 2003.

[15] B. B. Mandlebrot. The fractal geometry of nature. Freeman,New York, 1982.

[16] G. S. Manku and R. Motwani. Approximate frequencycounts over data streams. In Proc. of VLDB, 2002.

[17] D. S. Mazel and M. H. Hayes. Using iterated function sys-tems to model discrete sequences. IEEE Transactions onSignal Processing, 40(7):1724–1734, July 1992.

[18] X. Wu and D. Barbara. Using fractals to compress real datasets: Is it feasible? In Proc. of SIGKDD, 2003.

[19] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive orfalse negative: Mining frequent itemsets from high speedtransactional data streams. In Proc. of VLDB, 2004.

[20] A. Zhou, S. Qin, and W. Qian. Adaptively detecting aggre-gation bursts in data streams. In Proc. of DASFAA, 2005.

[21] Y. Zhu and D. Shasha. Efficient elastic burst detection indata streams. In Proc. of SIGKDD, 2003.


[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Documents