sharing-aware outlier analytics over high-volume data...

14
Sharing-Aware Outlier Analytics over High-Volume Data Streams Lei Cao , Jiayuan Wang , Elke A. Rundensteiner IBM T.J. Watson Research Center Yorktown Heights, NY 10598, USA Worcester Polytechnic Institute Worcester, MA 01609, USA [email protected], jwang1|[email protected] ABSTRACT Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter set- tings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the out- lier mining process. In this work we propose a sharing-aware multi- query execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline com- putation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves mini- mal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our exper- imental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint - a clear win- win. Furthermore, SOP is shown to scale to large workloads com- posed of thousands of parameterized queries. Keywords Outlier;Stream;Multi-query 1. INTRODUCTION Motivation. Nowadays, high-speed data-intensive stream moni- toring applications ranging from credit card fraud detection, net- work intrusion prevention, stock investment tactical planning to telephone fraud detection [16] necessitate the extraction of outliers from huge volumes of stream data in a near real-time fashion. In recent years distance-based outlier methods [12, 18, 2, 5] have been widely adopted for the detection of outliers in high volume stream data due to their simplicity and insensitivity to concept drift. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00 DOI: http://dx.doi.org/10.1145/2882903.2882920 The distance-based outlier model initially proposed in [12] origi- nates from the basic notion of capturing abnormal phenomena in data which can be traced back to the seminal work by Hawkins [10]. In [10] Hawkins states that outliers can be identified based on assessing the similarity among points in a dataset. More specif- ically an outlier is an object O with fewer than k neighbors in the dataset D, where a neighbor is defined to be any other object in D that is within a distance range r from object O. To discover abnormal phenomena from streaming data using this model of distance-based outliers, a set of input parameters have to be specified by the analyst. For example, when monitoring the po- tential credit fraud in bank transaction streams, analysts may look for unusual transactions whose values in recent days significantly differ from those of the majority of transactions made by peers at similar income levels. To detect such outliers utilizing the distance- based outlier technique, several parameter values have to be appro- priately set. These include the r parameter that defines the notion of significance in transaction value dissimilarity and the k parameter that determines the majority of the peer transactions. In addition sliding window semantics [3] need to be applied to ensure that the outliers are continuously detected based on the most recent portion of the input stream only. Out-of-date information, such as the bank transactions made a long time ago, are typically no longer relevant for the interpretation of the recent outlier detection results. Thus they should be purged from the active stream window. Therefore the sliding window specific parameters, such as the window size and the slide size, also have to be specified by the analyst. A stream outlier detection system may need to handle a large number of such parameterized queries for a variety of reasons. First, multiple analysts monitoring the same input stream may sub- mit their outlier search requests with different parameter settings. In the credit fraud monitoring application, with a number of an- alysts monitoring the same bank transaction stream [16] - each may customize their outlier search requests by tuning their param- eter settings using their personalized interpretation of abnormity. Even a single data analyst may submit multiple queries with dis- tinct parameter settings, because determining apriori the most ef- fective input parameters is difficult - if not impossible - especially when faced with an unknown or widely fluctuating input stream. In streaming environments, the continuous nature of streaming data emphasizes the necessity of supplying the most appropriate input parameters settings early on. Otherwise, the opportunity to accu- rately discover outliers in the segment of the stream gone by may be permanently lost. Furthermore, each analyst may provide their own personalized understanding of how large the “most recent” portion of the data to be considered is essential. Hence they may submit multiple outlier detection requests with different window related 527

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

Sharing-Aware Outlier Analytics over High-Volume DataStreams

Lei Cao†, Jiayuan Wang‡, Elke A. Rundensteiner‡†IBM T.J. Watson Research Center Yorktown Heights, NY 10598, USA

‡Worcester Polytechnic Institute Worcester, MA 01609, [email protected], jwang1|[email protected]

ABSTRACTReal-time analytics of anomalous phenomena on streaming datatypically relies on processing a large variety of continuous outlierdetection requests, each configured with different parameter set-tings. The processing of such complex outlier analytics workloadsis resource consuming due to the algorithmic complexity of the out-lier mining process. In this work we propose a sharing-aware multi-query execution strategy for outlier detection on data streams calledSOP. A key insight of SOP is to transform the problem of handlinga multi-query outlier analytics workload into a single-query skylinecomputation problem. We prove that the output of the skyline com-putation process corresponds to the minimal information neededfor determining the outlier status of any point in the stream. Basedon this new formulation, we design a customized skyline algorithmcalled K-SKY that leverages the domination relationships amongthe streaming data points to minimize the number of data pointsthat must be evaluated for supporting multi-query outlier detection.Based on this K-SKY algorithm, our SOP solution achieves mini-mal utilization of both computational and memory resources for theprocessing of these complex outlier analytics workload. Our exper-imental study demonstrates that SOP consistently outperforms thestate-of-art solutions by three orders of magnitude in CPU time,while only consuming 5% of their memory footprint � a clear win-win. Furthermore, SOP is shown to scale to large workloads com-posed of thousands of parameterized queries.

KeywordsOutlier;Stream;Multi-query

1. INTRODUCTIONMotivation. Nowadays, high-speed data-intensive stream moni-

toring applications ranging from credit card fraud detection, net-work intrusion prevention, stock investment tactical planning totelephone fraud detection [16] necessitate the extraction of outliersfrom huge volumes of stream data in a near real-time fashion.

In recent years distance-based outlier methods [12, 18, 2, 5] havebeen widely adopted for the detection of outliers in high volumestream data due to their simplicity and insensitivity to concept drift.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’16, June 26-July 01, 2016, San Francisco, CA, USAc� 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00

DOI: http://dx.doi.org/10.1145/2882903.2882920

The distance-based outlier model initially proposed in [12] origi-nates from the basic notion of capturing abnormal phenomena indata which can be traced back to the seminal work by Hawkins[10]. In [10] Hawkins states that outliers can be identified basedon assessing the similarity among points in a dataset. More specif-ically an outlier is an object O with fewer than k neighbors in thedataset D, where a neighbor is defined to be any other object in Dthat is within a distance range r from object O.

To discover abnormal phenomena from streaming data using thismodel of distance-based outliers, a set of input parameters have tobe specified by the analyst. For example, when monitoring the po-tential credit fraud in bank transaction streams, analysts may lookfor unusual transactions whose values in recent days significantlydiffer from those of the majority of transactions made by peers atsimilar income levels. To detect such outliers utilizing the distance-based outlier technique, several parameter values have to be appro-priately set. These include the r parameter that defines the notion ofsignificance in transaction value dissimilarity and the k parameterthat determines the majority of the peer transactions. In additionsliding window semantics [3] need to be applied to ensure that theoutliers are continuously detected based on the most recent portionof the input stream only. Out-of-date information, such as the banktransactions made a long time ago, are typically no longer relevantfor the interpretation of the recent outlier detection results. Thusthey should be purged from the active stream window. Thereforethe sliding window specific parameters, such as the window sizeand the slide size, also have to be specified by the analyst.

A stream outlier detection system may need to handle a largenumber of such parameterized queries for a variety of reasons.First, multiple analysts monitoring the same input stream may sub-mit their outlier search requests with different parameter settings.In the credit fraud monitoring application, with a number of an-alysts monitoring the same bank transaction stream [16] � eachmay customize their outlier search requests by tuning their param-eter settings using their personalized interpretation of abnormity.Even a single data analyst may submit multiple queries with dis-tinct parameter settings, because determining apriori the most ef-fective input parameters is difficult � if not impossible � especiallywhen faced with an unknown or widely fluctuating input stream. Instreaming environments, the continuous nature of streaming dataemphasizes the necessity of supplying the most appropriate inputparameters settings early on. Otherwise, the opportunity to accu-rately discover outliers in the segment of the stream gone by may bepermanently lost. Furthermore, each analyst may provide their ownpersonalized understanding of how large the “most recent” portionof the data to be considered is essential. Hence they may submitmultiple outlier detection requests with different window related

527

Page 2: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

parameter settings ranging from a short-term minute long view to along-term weekly or even monthly perspective.

For these plethora of reasons, a large number of outlier detec-tion queries with different pattern and window related parametersettings may simultaneously need to be processed by the system.Thus, a stream processing system must be able to accommodate alarge outlier analytics workload composed of hundreds or more ofrequests covering many, if not all, major parameter settings of anoutlier query, and thus striving to capture the most valuable outliersin the stream.

Limitations of the State-of-the-Art. Although efforts have beenmade in developing efficient algorithms for distance-based outlierdetection on data streams [7, 13, 1], these algorithms focus on han-dling one single outlier request with a fixed parameter setting. Thesimultaneous execution of multiple requests with varying patternand window specific parameters remains largely unexplored.

Unfortunately LEAP [7], the state-of-the-art solution for han-dling a single outlier detection request over data streams, is notsufficient when applied to a large analytics workload composed ofhundreds or more of such requests. This is caused by the highdemand on both computational and memory resources to supporteach query in the workload independently. Our experiments con-firm that LEAP indeed cannot meet the real time responsivenessrequirements when handing a large workload (Sec. 6). It takesaround 100 seconds to support a medium size workload (hundredqueries) on a medium speed data stream (100KB per second).

While [13] also focuses on distance-based outlier detection for asingle parameter setting, it also discusses an initial multiple outlierquery solution. However this work is limited in that it only con-siders queries for a variation of pattern-specific parameters. Thewindow-specific parameters are completely overlooked. Even forthis pattern-specific parameter case, [13] continuous to rely on con-ducting routinely expensive range query searches that acquire allpossible neighbors for all queries. Yet as already shown in [7],such range-query based solutions struggle to handle even one sin-gle outlier request when dealing with high volume streams [16].

Challenges. To handle a large workload composed of hundredsor even thousands of outlier requests over data streams in real time,effective processing strategies must be devised that cleverly sharedthe system resources.

This is complicated by the fact that requests with different pa-rameter settings may cause totally different outliers to be identified.First, given a data point p, the evidence needed to prove its outlierstatus, i.e., whether it is an outlier or an inlier, with respect to dis-tinct outlier interpretations (parameter settings) can differ. Clearly,variations in pattern-specific parameters lead to distinct criteria indefining the affinity among data points. Furthermore, due to varia-tions in window-specific parameters even the data subpopulation ofthe data stream each query is posed against can differ dramatically.

Our Proposed SOP Approach. In this work we propose a inno-vative approach, called SOP, that efficiently handles an outlier an-alytics workload composed of a large number of outlier detectionrequests with arbitrary parameter settings, while still guaranteeingthat each data point is processed only once.

As foundation of our solution we make the important observa-tion that a workload composed of multiple outlier requests witharbitrary parameter settings can be correctly answered by directlyutilizing one single skyband query [17] � a generalization of thewell known skyline concept. Better yet we prove that given onedata point, the skyband points discovered by the skyband query arethe minimal yet sufficient evidence required to prove its outlier sta-tus with respect to all outlier queries in the workload.

Second, we design a customized skyband query algorithm called

K-SKY that leverages the unique domination relationships amongthe streaming data points that presents their relative importance tothe detection of outliers. We prove that K-SKY is guaranteed todiscover all skyband points by examining only the minimal numberof data points. Furthermore by prioritizing the processing order ofthe data points in the stream based on their arrival time, K-SKY nat-urally takes advantage of the common data populations among thedata points covered by the active windows of queries with distinctwindow-specific parameters. This way redundant skyband pointcomputations are eliminated. In short our K-SKY algorithm is op-timal in minimizing the number of data points evaluated when pro-cessing an outlier analytics workload.

Based on the K-SKY algorithm our integrated strategy calledSOP (Sharing-Aware Outlier Processing) achieves full sharing ofboth CPU computations and memory utilization for the processingof a workload of outlier queries. Computation-wise, in each ac-tive window, SOP only requires a single pass through the batch ofnew data points. That is, for each point we run one single skybandquery to answer all queries in the workload. Memory-wise, SOPintegrates all skyband points into a single compact data structurecalled LSKY. This assures that only one single copy of the neigh-bors shared across all queries is maintained.

Our experimental studies on real data streams demonstrate thatSOP successfully drives down the CPU costs by over three ordersof magnitude with significantly less memory utilization comparedto [13, 7] for a rich diversity of scenarios. Furthermore, it is theonly known method that scales to huge workloads composed ofthousands of outlier requests.

Contributions. Our contributions include:1) Our SOP framework is the first to tackle the problem of shared

execution of multiple outlier requests with arbitrary pattern andwindow specific parameters in the stream context.

2) The key innovation of SOP is to transform the multi-queryoutlier problem into a single-query skyband problem. The outputof the skyband query is proven to be minimal yet sufficient for de-termining the outlier status of each point for any parameter settingon the workload.

3) Our customized skyband algorithm is tuned to process outlierrequests with diverse parameter settings. K-SKY is proven to beoptimal in the number of points being evaluated.

4) Leveraging the commonality and dominance among the datapopulations, we are able to utilize one specific skyband query tosupport multiple queries with varying window specific parameters.By this full sharing is achieved across the query windows.

5) Our extensive experiments demonstrate that SOP routinelyachieves three orders of magnitude or more speed up over the state-of-the-art methods [13, 7].

2. PROBLEM DEFINITIONDistance-Based Outliers in Sliding Windows. We first review thenotion of distance-based outliers [12]. We use the term data pointor point to refer to a multi-dimensional tuple in a data stream. Thefunction dist(p

i

, pj

) denotes the distance between data points pi

and pj

. Distance-based outlier detection uses a range threshold r todefine the neighbor relationship between any two points p

i

and pj

.

Definition 1. Two points pi

and pj

in a dataset D are said tobe neighbors if dist(p

i

, pj

) r with r a range threshold.

The function nn(pi

, r) represents the number of neighbors adata point p

i

has within range r.

528

Page 3: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

Definition 2. Distance-based Outlier. Given a distancethreshold r and a count threshold k(k > 0 ), a data point p

i

is re-garded as an outlier in dataset D if nn(p

i

, r) < k .

Distance-based outlier detection only works for data sets for whichan appropriate distance function can be defined. While this is straight-forward for numerical attributes, textual or categorical attributeswould typically have to be transformed into numerical attributesusing existing methods in the literature [19].

We use qi

(ri

, ki

) to denote the outlier detection query qi

with ri

and ki

the range and count thresholds respectively.We focus on periodic sliding window semantics as proposed by

CQL [3] and widely used in the literature [13, 7]. Such semanticscan be either time or count-based. In both cases, each query q

i

hasa window size q

i

.win (either a time interval or a tuple count) anda slide size q

i

.slide . For time-based windows each window Wc

of qi

has a starting time Wc

.Tstart

and an ending time Wc

.Tend

= Wc

.Tstart

+ Q .win . Periodically the current window Wc

slides,causing W

c

.Tstart

and Wc

.Tend

to increase by qi

.slide . For count-based windows, a fixed number (count) of data points correspondsto the window size q

i

.win . The window slides after the arrival ofqi

.slide new data points. The arrival time of point p is denoted asp.time . If W

c

.Tstart

p.time <Wc

.Tend

, p falls into Wc

.Outliers will be generated based on the points that fall into the

current window Wc

, also called the population of Wc

. A point pin W

c

might have a different outlier status (outlier or inlier) in thenext window W

c+1 if it is still alive in Wc+1, since each window

has a different population.

Definition 3. Distance-Based Outlier Detection In Sliding Win-dows. Given a stream S, a streaming distance-based outlier detec-tion query q

i

(ri

, ki

,wini

, slidei

), qi

continuously detects and out-puts the outliers in the current window W

c

after the window slidesfrom the previous to the current window W

c

.

Multiple Outlier Detection Optimization. Outlier detection re-quests on the same input stream can have arbitrary settings on allfour parameters r, k, win, and slide. The set of outlier requests q

i

that must be concurrently processed is denoted as query group Q.Each query q

i

in Q is called a member query of Q. Our goal is tominimize both the processing time and the memory space neededto answer all queries in a large outlier analytics workload.

3. VARYING THE DISTANCE-BASED OUT-LIER PARAMETERS

In this section we first introduce our transformation of processinga workload composed of queries with varying r but fixed k param-eters into a skyband query. Then we present our K-SKY algorithmthat supports such skyband query with optimality. Next we extendK-SKY to handle outlier detection queries with arbitrary k and rparameters. In this section we assume all queries share the samesliding window parameters win and slide.

3.1 K-SKY: Varying Parameter - rGiven a query group Q with varying r but fixed k parameters, the

goal is to design an approach that supports all member queries inQ with each point p of data stream S processed only once in eachcurrent window W

c

. The key insight here is that given such a querygroup Q and one data point p in current window W

c

of stream S,the output of one single customized K -skyband query is sufficientyet necessary to determine the outlier status of p with respect to allqueries in Q.

3.1.1 From Multi-Query Outlier Workloads to SingleQuery Skyband Processing

K -skyband query is a generalization of the well known skylineconcept. As defined in [17] a K -skyband query reports all pointsthat are dominated by no more than K points. The case K = 0 cor-responds to a conventional skyline. The key idea underlying thisskyline concept is to define the domination relationship betweenany two data points. As a simple example consider a dataset Dcomposed of n one dimensional data points, namely n distinct val-ues {p

1

, p2

, p3

, ..., pn

}. Assume the domination relationship be-tween any pair of data points p

i

and pj

( 1 i , j n) is definedas p

i

dominates pj

if pi

> pj

. Then the K -skyband (K=2) queryon dataset D returns the top-3 largest points in D: {p

max

, pmax�1

,pmax�2

}. pmax�2

is dominated by the two data points pmax

andpmax�1

, while all other data points in D are dominated by at leastthese three top-3 points of D.

To map our problem of determining the outlier status of a givenpoint p to the K -skyband problem, we have to similarly definethe domination relationship between any pair of data points in thedataset D

Wc , i.e., the population of the current window Wc

. Thekey observation here is that given any two points p

i

and pj

, twokey factors, namely their relative arrival time and the distance tothe point p under evaluation, determine whether p

i

is more impor-tant than p

j

in terms of evaluating the outlier status of p.Let us introduce a query group Q used in the remainder of this

section. Assume we have a query group Q: {q1(r1),q2(r2),...,qm(rm

),qm+1(rm+1), ..., q

n

(rn

)} 1, where rm

represents the r parameterof query q

m

. The r parameter of q1, q2,...,qn

monotonically in-creases, that is, r1 < r2 < ... < r

m

< rm+1 < ... < r

n

.Distance Dimension. In distance-based outlier definition (Def.

2), points in a dataset D are classified either as outliers or inliers.Thus, the process of identifying outliers in D is equivalent to theprocess of finding and eliminating inliers from it. By Def. 2, p isguaranteed to be an inlier once k neighbors are acquired in D. Giventwo points p

i

and pj

, assume dist(pi

, p) < rm

< dist(pj

, p) <rm+1. Then p

i

is the neighbor of p with respect to query subsetQ

i

= {qm

, ...,qn

}, while pj

is the neighbor of p only with respectto query subset Q

j

= {qm+1, ...,q

n

}. Qi

� Qj

. In other words pi

satisfies the neighbor requirement of more queries than pj

. For theevaluation of p, p

i

is more important than pj

, because pi

makes theoutlier status of p closer to be determined with respect to all queriesin Q than p

j

. In this perspective pi

dominates pj

.On the other hand, assume r

m

< dist(pi

, p) < dist(pj

, p) <rm+1. Then p

i

and pj

are both neighbors of p for the same set ofqueries {q

m+1, ..., qn

}. In this scenario pi

and pj

equally affectthe outlier status of p although dist(p

i

, p) 6= dist(pj

, p). Based onthis observation we now are ready to re-define the distance functiondist(p, p

i

) so to normalize the distance between data points. Theoriginal distance function is denoted as dist

o

(p, pi

) instead.

Definition 4. Given a query group Q: {q1

(r1

), q2

(r2

), ..., qm

(rm

),qm+1

(rm+1

), ..., qn

(rn

)} with r1

< r2

< ... < rm

< rm+1

< ... < rn

,dist(p, p

i

) = m + 1 if rm

< disto

(p, pi

) rm+1

for 0 m nwith r0 defined as -1 and r

n+1 defined as 1.

By Def. 4, dist(pi

, p) = dist(pj

, p) if rm

< disto

(pi

, p) <dist

o

(pj

, p) < rm+1. This new normalized distance calculated us-

ing Def. 4 now accurately represents the importance of each datapoint to p.1For the ease of readability, we only list those parameters inthe query notation q

i

(r , k ,win, slide) that vary. In this case(k ,win, slide) would be removed from q

i

, since only parameterr is a variable.

529

Page 4: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

Time Dimension. In the streaming context the presence of thetime dimension further complicates matters. In particular we can-not simply claim that one data point p

i

closer to p impacts the statusof p more than the other points. Instead the arrival time of the datapoints also has to be taken into consideration. A point p

i

that ar-rived later in the window may have a more decisive impact on theoutlier examination process compared to an earlier arriving p

j

evenif p

i

is not closer to p than pj

. This is so because the younger a datapoint p

i

is, the longer its neighbor relationships (if any) with p willpersist into the future.

Domination Relationship. We now define the domination rela-tionship between the pair of points in dataset D

Wc that takes boththe distance and time dimensions into consideration.

Definition 5. Domination Relationship. Given a query groupQ: {q

1

(r1

), q2

(r2

), ..., qm

(rm

), qm+1

(rm+1

), ..., qn

(rn

)} withr1

< r2

< ... < rm

< rm+1

< ... < rn

, point pi

dominates pj

withrespect to point p if: (1) p

i

.time > pj

.time; (2) dist(p, pi

) dist(p, p

j

) (pi

, pj

2DW

c

� p) and p 2DW

c

; (3) dist(p, pi

) n ,with dist() the normalized distance of Q defined in Def. 4.

In other words, given a data point pi

, pi

dominates another pointpj

only if pi

expires later than pj

from window Wc

(Condition 1)and it is not further away from p than p

j

(Condition 2). The thirdcondition in the domination rule filters out any data point p

i

that isnot a neighbor of p for any query in Q. As otherwise this p

i

wouldnever be influencing the outlier status of p.

Based on the domination relationship defined in Def. 5, the out-lier status of p with respect to all queries in Q can now be cor-rectly answered based on the skyband points delivered by one sin-gle (k � 1 )-skyband query denoted as Qs , namely the K -skybandquery with K specified as k-1 2.

Lemma 1. Given a query group Q, for any data point p, theoutput of the skyband query Qs corresponding to Q, denoted as S

p

,is sufficient and necessary to continuously determine the outlierstatus of p with respect to all queries in Q.

While the formal proof of Lemma 1 can be found in AppendixA, below we sketch the key ideas of the proof.

Sufficiency. The sufficiency of this mapping is based on twoobservations, namely the KNN observation and the K-distance ob-servation as explained below.

KNN Observation. First, Qs always returns the k nearest neigh-bors of p as part of the skyband points. The k nearest neighbors ofp denoted as kNN (p) are k points in D

W

c

that do not have largerdistance to p than any other point in D

W

c

. The proof of this obser-vation is intuitive. Given any point p

i

2 kNN (p), at most k � 1points in D

W

c

are closer to p than pi

. By the domination relation-ship defined in Def. 5, at most k � 1 points in D

W

c

dominate pi

.Therefore p

i

is a skyband point of our skyband query Qs .K-distance Observation. Second, once kNN (p) is discovered,

the outlier status of p with respect to each query in Q can be de-termined by examining the distance between p and its kth-nearestneighbor called k -distance(p). If r

m

< k-distance(p) rm+1,

then p is guaranteed to be an outlier for queries {q1,q2,...,qm

} andan inlier for queries {q

m+1,...,qn

}.Justifying this observation is straightforward. If k-distance(p)

rm+1, then all points in kNN (p) are neighbors of p for queries

{qm+1,...,q

n

}. Therefore p is an inlier for such queries. On theother hands, since k-distance(p) > r

m

, p does not have k neighborsfor queries {q1,q2,...,q

m

}. Otherwise the points in kNN (p) wouldnot be the k nearest points to p in D

W

c

. Thus p is an outlier toqueries {q1,q2,...,q

m

}.2For simplicity this notation does not reflect p and k � 1 .

Stream

p1:

<t1,2>

p2:

<t2,3>

p3:

<t3,2>

p4:

<t4,1>

p5:

<t5,1>

p6:

<t6,4>

p7:

<t7,3>

p8:

<t8,2>

p6:

<t6,4>

p7:

<t7,3>

p8:

<t8,2>

p5:

<t5,1>

p10:

<t10,5>

p11:

<t11,6>

p12:

<t12,4>

p9:

<t9,4>

Wc

Wc+1

Figure 1: Sliding window stream

Next we illustrate these two observations with an example.

Example 1. Given a query group Q: {q1(1),q2(2),q3(3)} withthe k parameter set as 3 and the dataset D

Wc composed of pointspi

represented in the arrival time and distance space (< ti

, di

>):{p1 :< t1, 2 >,p2 :< t2, 3 >, p3 :< t3, 2 > ,p4 :< t4, 1 >, p5 :<t5, 1 >, p6 :< t6, 4 >, p7 :< t7, 3 >, p8 :< t8, 2 > } as shownin Fig. 1. Here t

i

indicates the arrival time of pi

(t1 < t2 <...<t8) and d

i

indicates the distance of pi

to p. The (k � 1 )-skybandquery Qs with k = 3 will return {p4 :< t4, 1 >, p5 :< t5, 1 >,< p7 : t7, 3 >, < p8 : t8, 2 >} as the skyband points in windowW

c

. A subset of this result, namely {p4 :< t4, 1 >, p5 :< t5, 1 >,p8 :< t8, 2 >} is the kNN of p. The k-distance of p thus is 2. By thek-distance observation we can correctly derive the outlier status ofp. Namely p is an outlier for q1, while being an inlier for q2 andq3.

Necessity. Note in the above example since the skyband pointp7

:< t7

, 3 > is not in the kNN(p) set of Wc

, p7 is not utilizedto evaluate p in W

c

. However p7 arrived later than p4 and p5 inkNN(p). Potentially it might still benefit the evaluation of p in thefuture windows.

As shown in Fig. 1 when the window slides from Wc

to Wc+1,

< t4

, 1 > will expire. Since all new arrivals pi

({p9, p10, p11,p12}) in W

c+1 are far from p, namely dist(p, pi

) > 3 , now p7will be in kNN(p) = {< t5, 1 >, < t7, 3 >, < t8, 2 >} of W

c+1.As the third nearest neighbor of p, the distance between p7 and pdist(p

7

, p) = 3 will be utilized to determine the outlier status ofp. Now p is an outlier for q1 and q2, while being an inlier only forq3.

3.1.2 The K-SKY AlgorithmAlthough the traditional K-skyband algorithms could be applied

to support our Qs query [17, 21], we now design a customized al-gorithm called K-SKY that more efficiently supports the multipleoutlier detection queries compared to existing algorithms [17, 21].K-SKY encompasses two optimization principles, namely time-aware prioritization and least examination, that leverage the uniqueproperties of the domination relationship among the streaming pointsshown in our outlier detection context. K-SKY is proven to be op-timal in minimizing the number of data points to be evaluated inthe skyband point discovery process.

Time-Aware Prioritization Principle. Given a data point pi

only two attributes are considered in the domination relationship ofour skyband problem, namely the distance to a certain point p andthe arrival time of p

i

as defined in Def. 5. Furthermore, in slid-ing window streams the data points are naturally ordered by theirarrival time. In other words, all data points can effectively beenconsidered to be sorted on their arrival time attribute upon arrival.Therefore K-SKY effectively only needs to consider one attribute(distance to p) in the skyband point discovery process. By the def-inition of the domination relationship, later arrivals will never bedominated by the earlier arrivals. Leveraging this property we pri-oritize the order in which the K-SKY algorithm processes the datapoints. More specifically K-SKY always conducts the search witha later arriving data points first order. By this if one data point is notdominated by more than k points in the distance attribute and thusconsidered to be a skyband point, then it is not necessary to eval-

530

Page 5: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

uate it again. This is so, because it will be guaranteed to never bedominated by other points evaluated later. Thus all skyband pointscan be discovered in one pass over the data set.

Better yet, given a data point pi

with dist(pi

, p) no larger thanthe smallest r value r1 in Q, if p

i

has already been dominated byk points when evaluated, K-SKY can be terminated immediately.This is so because all remaining (unevaluated) points would bedominated by at least these k points that dominate p

i

. ThereforeK-SKY can safely terminate without even examining all points.

Least Examination Principle. Second, in the sliding windowcontext, the K-SKY search is applied in two situations. First, anynew point p that just arrived in the current window W

c

needs K-SKY to figure out its skyband points in the current window. Sec-ond, an existing point p needs K-SKY to update its skyband pointswhen the stream slides to the current window W

c

. In the first situa-tion, for a newly arriving point p, K-SKY has to be conducted fromscratch to search for the needed information of p. Instead in thesecond situation the key observation here is that given the skybandpoints of the window W

c�1, to acquire the skyband data points ofa new window W

c

, only a small fraction of data points in Wc

needto be evaluated, namely the new arrivals and the unexpired skybandpoints of W

c�1.This is so because any existing data point p

i

in Wc

could notpossibly be a skyband point in window W

c

if pi

is not also a sky-band point in W

c�1. If pi

is not listed in the skybandPoints set ofW

c�1, pi

must be dominated by at least k data points pj

in skyband-Points. By the domination rule defined in Def. 5, if p

j

dominatespi

, pj

.time > pi

.time . This indicates pj

would not expire earlierthan p

i

. If pi

is still valid in window Wc

, pj

would also remainvalid. Therefore in W

c

, pi

could not possibly be a skyband point,since it is still dominated by at least k data points.

Algorithm 1 K-SKY(p,Wc

.plist ,p.skyband ,Q)Output: skybandPoints //the k-1-skyband point set1: if p.skyband == NULL then2: W

c

.input = W

c

.plist ; // New point;search from scratch3: else4: expireSkyband(p.skyband)5: W

c

.input = p.skyband + W

c

.pList.new ; //Old point; search in newarrivals and unexpired skyband points

6: end if7: for each pi 2 W

c

.input from W

c

.input.tail to W

c

.plist.head do8: d = dist(p, p

i

);9: if (TRUE == pi.skyEvaluate(d,skybandPoints, Q)) then10: p.updateOutlierStatus(Q);11: else12: if d Q.rmin then13: break;14: end if15: end if16: end for

K-SKY Algorithm. Next we show how K-SKY detects the (k-1)-skyband points in each window W

c

. The skyband is computedevery time when the window moves. In other words, the K-SKYalgorithm is called after we receive a batch of new points based onthe slide size. As shown in Alg. 1, the points of window W

c

arestored in a list structure W

c

.plist . When the streaming data arrives,the later arrivals are appended at the tail of W

c

.plist . Therefore thepoints in W

c

.plist are naturally ordered by their arrival time. If pis a new point of W

c

, then the search has to be conducted fromscratch (Lines 1,2). Otherwise based on our Least Examinationoptimization principle K-SKY only search in the new arrivals andthe unexpired skyband points of p (Lines 3-5).

Then guided by our time-aware prioritization optimization prin-ciple, K-SKY evaluates the data points of the input list W

c

.plistin the order from tail to head, i.e., via a “last come, first served”

order (Line 7). After calculating the distance between p and somedata point p

i

, K-SKY evaluates whether pi

is dominated by at leastk already discovered skyband candidate points. If not, p

i

will beinserted into the candidate point set skybandPoints (Line 9). Other-wise if p

i

is not a skyband point and dist(pi

, p) is not larger thanthe smallest r parameter r

1

in Q, K-SKY terminates (Lines 12,13).This is so because again by our time-aware prioritization principle,the remaining points does not have chance to be skyband points.

Leveraging the time-aware prioritization and least examinationoptimization principles, K-SKY is able to discover all skyband pointsby scanning the data set at most once. In other words, K-SKY isa one pass algorithm. Furthermore, it may terminate without evenseeing all data points. We now show that it is optimal in minimizingthe number of points being evaluated in the execution process.

Lemma 2. Optimality. K-SKY correctly discovers the (k-1)-skyband points in window W

c

by examining only the minimumnumber of data points.

The proof of Lemma 2 can be found in Appendix B.The skyEvaluate Algorithm. The complexity of K-SKY relies

on the number of points being evaluated and on the cost of eval-uating each point, that is, the cost to determine whether a givenpoint p

i

is a skyband point or not. This decision is computedby the subroutine skyEvaluate of K-SKY. Since the number ofpoints examined by K-SKY has already been proven to be mini-mal, the reduction of the second cost per point is now critical forhigh-performance of K-SKY. For this we must design an efficientskyEvaluate algorithm.

Algorithm 2 skyEvaluate(d,skybandPoints, Q)Output: isSkyband; //Boolean: skyband point or not1: layer = skybandPoints.getLayer(d);2: count = 0;3: for i =1; i++; i layer do4: count += skybandPoints.layerCount(i);5: end for6: if count k - 1 then7: skybandPoints.map(pi);8: return true;9: else10: return false;11: end if

First we introduce the core data structure of the K-SKY algo-rithm called LSky. LSky is a layered data structure that stores theskyband points acquired in the execution process of K-SKY. It playsa critical role in assisting skyEvaluate to effectively determinewhether a point p

i

is a skyband point.In LSky, skyband points are organized into a layered two dimen-

sional structure that preserves the order among the skyband pointsin both the distance and the time dimensions. As shown in Fig. 2,the points in each layer have the same distance to point p based onthe normalized distance function in Def. 4. The points in the upperlayer always have a smaller distance to p than the points in lowerlayers. Furthermore, in each layer the points are ordered based ontheir arrival time with the earliest arrival being at the head. By this,skyband points can be quickly expired when the window slides for-ward in time.

As shown in Fig. 2, given a data point pi

, skyEvaluate first cal-culates which layer it belongs to (Line 1). More specifically, givena query group Q: {q

1

(r1

), q2

(r2

) ,..., qm

(rm

), qm+1

(rm+1

) ,...,qn

(rn

) } with r1 < r2 < ... < rm

< rm+1 < ... < r

n

, a point pi

should be mapped to the layer corresponding to rm

(bucket Bm

) ifrm�1 < dist(p, p

i

) rm

. This can be done in logarithmic time inthe number of buckets using a binary search.

531

Page 6: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

Next, skyEvaluate evaluates whether a point pi

is a skybandpoint. Since K-SKY processes data points in the “last come, firstserved” order, a point p

i

to be inserted into LSky is guaranteed tobe dominated by the points falling in the same layer with p

i

and thepoints within its upper layers. If in total there are fewer than k suchpoints in LSky when p

i

is processed, then pi

will be a skybandpoint (Lines 6 - 8). This can be easily determined by explicitlymaintaining the cardinality of each layer (Lines 3 - 5).

Points

Buckets

p1:

<t1,2>

p2:

<t2,3>

p3:

<t3,2>

p4:

<t4,1>

p5:

<t5,1>

p6:

<t6,4>

p7:

<t7,3>

p8:

<t8,2>

B1 d≤ 1

B2 1<d≤2

B3 2<d≤3

Stream

Process order!"# $"%%"&

Figure 2: Skyband Point Search With LSkyNext we utilize an example to demonstrate how K-SKY detects

the skyband points with the assistance of the LSky structure.

Example 2. Given the stream and the queries in Example 1,point p

8

is processed first by K-SKY as shown in Fig. 2. Since 1< dist(p, p

8

) = 2 2 , by Def. 4, p8 is hashed into bucket B2.The next point processed by K-SKY is p7. Correspondingly p7 isinserted into bucket B3 because 2 < dist(p, p

7

) = 3 3 . Pointp6 will be excluded from the skybandPoints set immediately sincedist(p, p

6

) = 4 is greater than the largest r parameter in querygroup Q. Points p5 and p4 instead will be inserted into bucketB1. By Def. 4, p3, p2, and p1 should be hashed into buckets B2,B3, and B2 correspondingly. However all of them are excludedfrom the LSky structure, since they are dominated by at least 3 datapoints. For example when we hash p3 into bucket B2, there arealready 2 points in B1 and 1 point in B2. Therefore p3 is dominatedby 3 points and thus it is not a skyband point. In this example theskyband points are {< t4, 1 >, < t5, 1 >, < t7, 3 >, < t8, 2 >}.

Complexity Analysis. With the assistance of the LSky structurethe overall complexity of K-SKY is O(LB) with L the numberof the points examined in K-SKY and B the number of the lay-ers (buckets) visited when evaluating whether a point is a skybandpoint. We have already proven that K-SKY is optimal in L. Now letus assume that the points are uniformly distributed among all lay-ers of LSky , then on average B equals to |r|

2 , where |r | representsthe number of unique r parameter values in query group Q. Givena window with |W | data points, the complexity of processing thewhole window therefore is O(|W |L |r|

2

).

3.2 Handling Various K and R ParametersWe now relax our problem to consider varying not only k but

also r parameters. One simple approach to handle a set of outlierdetection queries with arbitrary pattern related parameters k and rwould be to divide this workload into groups, each of which con-tains queries with the identical k parameter value. This then wouldsimplify our problem into a multi-skyband query problem with onlythe k parameter varying. Intuitively our problem then could be han-dled by directly applying K-SKY on each group of queries. How-ever this solution requires the independent identification and main-tenance of the skyband points for each group of queries. Since alarge number of skyband points are likely to be shared across theseskyband queries, this naive solution inevitably leads to significantwastage of CPU and memory resources. We now tackle this short-coming.

3.2.1 Sharing-Aware Multi-Skyband SolutionNext we propose a sharing-aware solution that efficiently solves

this multi-skyband query problem. By maintaining the skybandpoints in one integrated LSky structure, given a point p

i

, only onesingle skyband point evaluation operation is required to correctlyanswer all skyband queries. This way we assure that multiple sky-band queries are supported by K-SKY, while still guaranteeing thateach data point is evaluated exactly only once.

Given a query group Q, Q is partitioned into sub-groups Qj

:{(r

1

, kj

), (r2

, kj

), ..., (rn

, kj

)} (1 j max , k1 <k2 ... <kmax

,r1 < r2 ... < r

n

). The member queries in each sub-group Qj

sharethe same k parameter value (k

j

). Therefore each Qj

correspondsto one skyband query Qs

j

.The key idea here is that our K-SKY algorithm can handle any

number of skyband queries with distinct k parameter values withonly some slight adjustment in the criteria used to determine whethera point p

i

is a skyband point of at least one Qs

j

.

Definition 6. Skyband Point Rule. pi

is a skyband point if:(1) p

i

is hashed into some bucket Bm

;

(2) k0

=mP

j=1

| Bj

|< kmax

; and

(3) dist(p,pi

) max{rn

ofQj

| 8kj

> k0}.

The first two conditions in Def. 6 correspond to the examinationrule of the single query case except for replacing the k parameterof the single query with k

max

(the largest k parameter value in Q).However not all points satisfying these two conditions would be inthe skyband point set. Now p

i

is dominated by k0

points. By thedefinition of k-skyband query, p

i

would not be a skyband point ofquery Qs

j

unless k0

is smaller than the k parameter kj

of queries inQj (the query sub-group corresponding to Qs

j

). Furthermore, anypoint p

i

will be discarded by query Qs

j

if dist(p, pi

) is larger thanrn

of Qj by the domination relationship defined in Def. 5. Theabove two conditions are captured by Condition 3 in Def. 6.

Optimality. Based on the discussion of our time-aware priori-tization optimization principle it can be easily shown that K-SKYdiscovers all skyband points of multiple skyband queries in onepass over the data set. Here we provide the intuition. The process-ing order of K-SKY guarantees that the points added to skyband-Points during the execution of K-SKY would not be replaced later.Therefore similar to the case of processing a single skyband query,K-SKY correctly discovers the (k-1)-skyband points for all sky-band queries with only the minimum number of points evaluated.

Complexity Analysis. Similar to handling a single skybandquery, the complexity of K-SKY handling multiple skyband queriesis determined by the number of the points being evaluated (L) andthe number of the layers (B) that exist in LSky . Given a windowwith |W | data points, the complexity of processing the whole win-dow therefore is O(|W |LB).

3.2.2 Outlier Detection With K-SKYAfter acquiring the skyband points, these points then can be uti-

lized to determine the outlier status of p with respect to each mem-ber query in Q. For example, this can be done by first calculatingthe k -distance of p with respect to each query sub-group Q

j

andthen applying the k-distance observation (Sec. 3.1.1). However thekey observation here is that to determine the status of p, this extraprocess is superfluous. In fact, we observe that the outlier status ofp can be naturally derived as part of the skyband point discoveryprocess as explained below.

Inlier Rule. Suppose we have a query sub-group Qj

: {q1

(r1

, kj

),q2

(r2

, kj

), ..., qm

(rm

, kj

), qm+1

(rm+1

, kj

), ..., qn

(rn

, kj

)}. When

532

Page 7: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

evaluating whether pi

is a skyband point, if point pi

is found tobe dominated by k

j

� 1 points and mapped into bucket Bm

ofLSky, then p is guaranteed to be an inlier for a subset of queriesin Q

j

: {qm

, qm+1

, ..., qn

}. This is so because all points dominat-ing p

i

are as close as pi

to p. Since pi

is mapped to bucket Bm

,dist(p

i

, p) rm

. Therefore pi

along with all points dominating pi

(in total kj

points) are neighbors of p for {qm

, qm+1

, ..., qn

}. Bythe outlier definition in Def. 2, p is thus an inlier for these queries.

As shown inliers are naturally recognized during the process ofevaluating whether a point is a skyband point without introducingany extra overhead. This logic can be seamlessly applied in K-SKY(see Line 11 in Alg. 1) to mark p as inlier for the correspondingqueries. Eventually p will be reported as outlier for those queriesthat do not mark p as inlier after K-SKY terminates.Safe Inlier in Sliding Stream Windows. Furthermore, given apoint p its outlier status might not always need to be evaluated ineach window and against every query in Q. Potentially skybandpoints discovered in the current window might provide sufficientevidence to prove that p is an inlier during its entire remaining lifefor a particular subset of queries in Q, regardless of the character-istics of the future incoming stream. In this case, we would namep a guaranteed safe inlier with respect to a query subset Q

safe

ofquery group Q. This property arises due to the time order relation-ship among stream data points.

Safe Inlier Condition. We observe that p is guaranteed to be asafe inlier if the point p

i

that triggers the above inlier rule arriveslater than p. By the domination relationship definition in Def. 5, foreach of the k

j

� 1 points pj

that dominate pi

, pj

.time > pi

.time> p.time . In other words, all k

j

neighbors of p would have arrivedlater than p and in turn would not expire before p. Therefore theneighbor relationship between p and p

i

persists during the entirelife of p. p is thus a safe inlier.

Evaluating the above safe inlier condition in K-SKY is straight-forward. Namely when monitors the satisfaction of the inlier rule(Line 11 in Alg. 1), we also compare the arrival order of p

i

and p.Once p is determined to be a safe inlier, it is no longer neces-

sary to evaluate p for Qsafe

: {qm

, qm+1

, ..., qn

} in any future win-dow. Thus the discovery of safe inliers can significantly improvethe CPU and memory efficiency of K-SKY.

Next we demonstrate with an example how K-SKY determineswhether a given point p is an outlier with respect to query group Q.

Example 3. Given two query groups QG1 and QG2 in Fig. 3,on stream in Fig. 4 we demonstrate how K-SKY supports outlierdetection queries with varying k and r parameters. As shown inFig. 4, p8 is processed first and hashed to bucket B2 of LSky. p7is processed next and inserted into bucket B3. p7 is dominatedby one point in B2. In other words, p7 is dominated by k

1

� 1points, where k1 is the k parameter of QG1. By the inlier rulethis may cause p to be recognized as inlier for some queries inQG1. By comparing dist(p, p

7

) against the r parameters in QG1,p is confirmed to be inlier for queries < k

1

, r3

> and < k1

, r4

>,since dist(p, p

7

) r3

< r4

. Then K-SKY proceeds to hash p6into bucket B4. p6 is dominated by two points p7 and p8. Thistriggers the inlier status check for queries in QG2 (k2 = 3). Sincedist(p, p

6

) r4

, p is inlier for query < k2

, r4

>. Next p5 is pro-cessed and inserted into bucket B1. When K-SKY processes p4, p4is dominated by 2 (k2�1) points. Furthermore, dist(p, p

4

) rmin

of QG2 (r2 = 2). By the termination condition of K-SKY, QG2 (k2= 3) is terminated now, since all queries in QG2 classify p as inlier.Next p3 will be excluded from LSky, since p3 (in B3) is dominatedby four points and therefore is not a skyband point for any querygroup. After p2 is evaluated, there are two points p2 and p5 in B1

whose distance to p is not larger than rmin

= r1 = 1 of QG1 (k1

r

kr1= 1 r2 = 2 r3=3 r4=4

k1 = 2 X X X

k2 = 3 X X X

QG1

QG2

Figure 3: Queries with varying k and r parameters

Points

Buckets

p1:

<t1,2>

p2:

<t2,1>

p3:

<t3,3>

p4:

<t4,2>

p5:

<t5,1>

p6:

<t6,4>

p7:

<t7,3>

p8:

<t8,2>

B1 d≤ 1

B2 1<d≤2

B3 2<d≤3

B4 3<d≤4

Stream

<k1,r3><k1,r4>

inlier

<k2,r4>

inlier

!"#$

%&'()*+%&

!",$

%&'()*+%&

!"#$%&'(")

*+"#',,

Figure 4: K-SKY for multiple queries

= 2). This satisfies the termination condition of QG1. In turn p isclassified as inlier by all queries in QG1. This leads to the termi-nation of the outlier status evaluation process for point p, becauseboth query groups have been completed. The earliest arrival p1 isnot evaluated.

The Overall Outlier Detection Approach. The overall processof utilizing the K-SKY algorithm to continuously detect outliersfrom the sliding window stream is shown in Alg. 3.

Algorithm 3 detectOutlier(Wc

.plist ,Q)Output: outliers; //the outlier sets1: for each p 2 W

c

.plist do2: if (IsSafeInlier(p) == TRUE) then3: break;4: else5: K-SKY(p,W

c

.plist ,p.skyband ,Q);6: Qin = markInlierStatus(p, Q);7: markSafeInlier(p, Q);8: if Q - Qin 6= ; then9: insertOutlier(outliers,p,Q - Qin);10: end if11: end if12: end for

Given a point p in the current window Wc

, Alg. 3 first checkswhether p is a safe inlier (Line 2). The K-SKY algorithm will onlyexecute on the points that are not marked as safe inliers (Lines 4,5). Then based on the output of K-SKY, we mark p as inlier forthe corresponding queries Q

in

(Line 6). The safe inlier status of pwill also be updated (Line 7). Finally p is inserted into the outputset outliers if not all queries in Q classify p as inlier (Lines 8 - 10).Each element in the outlier set records one point p along with themember queries q

i

2 Q - Qin

that classify p as outlier.

4. VARYING SLIDING WINDOW PARAM-ETERS

Next, we study the case that the window parameters can vary.The key observation here is that such multiple queries can be sup-ported by utilizing one single customized skyband query. There-fore full sharing of both CPU and memory resources is achievedover sliding windows.

533

Page 8: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

4.1 Varying the Window Parameter - WinHere we first examine the scenario when the window sizes vary,

while the slide size remains stable. Therefore all queries slide to anew window at the same time. In other words, they are synchro-nized. All queries require output at exactly the same moment, i.e.,at time W

c

.end in Fig. 5. This observation leads to an impor-tant characteristic. Given a query group Q with member querieshaving the same slide size but arbitrary window sizes, Q can besupported with one skyband query with respect to q

max

denotedas qs

max

, namely the member query with largest window in Q asin Fig. 5. Intuitively this is so because the largest window coversall smaller windows. Therefore skyband points discovered in thelargest window can be utilized to answer all queries in the group.

Therefore by employing the K-SKY algorithm and collecting theskyband points for this special skyband query qs

max

, this outlierquery group Q can be correctly answered with each point p in thedata stream S evaluated only once in each window that contains p.

Wc.end

q1.Wc.start

q2.Wc.start

q3.Wc.start

qmax.Wc.start ! !!!!

outlierinlier

!

Succeeding neighbor

common points

Figure 5: Queries with varying window sides

As shown in Sec. 3.1.2, K-SKY gives more preference to thepoints arriving later than the points arriving earlier. That is, K-SKY always processes the later arrivals first. On the other hand thelater arrivals in the stream happen to be the common points amongthe data populations covered by the current windows of differentqueries (Fig. 5). Therefore K-SKY naturally leverages the datacommonality among the windows of distinct queries. Redundantcomputations during the skyband point discovery process are elim-inated.

Outlier Status Evaluation. After K-SKY terminates, as shownin Sec. 3.1.2, the outlier status of p 2 q

max

.DW

c

with respect toqmax

has already been determined. If qmax

marks p as outlier,then p is guaranteed to be an outlier for any other query q

i

in Q .Therefore it is not necessary to evaluate the outlier status of q

i

any-more. However this is not the case if p instead is marked as aninlier for q

max

. This is so because the neighbors of a particular qi

are only a subset of the neighbor set of qmax

. Even if qmax

hasacquired enough neighbors to prove the inlier status of p, this doesnot guarantee that p has enough neighbors for other queries withsmaller windows. Therefore an extra outlier status evaluation stepis necessary for other member queries q

i

in Q besides qmax

. For-tunately, as we demonstrate below, to determine the outlier statusof p with respect to the member queries in Q, it is not necessary toexamine all points in qs

max

.skyband .

Lemma 3. Given a query group Q: {q1

, q2

, ..., qm

, qm+1

, ...,,qmax

} (q1

.win < q2

.win < ... < qmax

.win), p is an outlier forqueries {q

1

, q2

, ..., qm

}, if qm+1

.Wc

.start < pold

.time < qm

.Wc

.start , where p

old

is the oldest point in qs

max

.skyband .

The formal proof of Lemma 3 can be found in Appendix C.

By Lemma 3 when p is an inlier of qmax

, to determine the outlierstatus of p with respect to all other queries in Q we only need toevaluate one single skyband point p

old

, namely the last skybandpoint acquired exactly when K-SKY terminates. This is achievedby locating the query q

m

with largest window size whose currentwindow W

c

has not started yet at the time when pold

arrives. Asshown in Fig. 5, queries q1 and q2 are outliers, because the oldestneighbor of p, namely p

old

arrived earlier than the start time of thewindows of q1 and q2.

Let us assume the queries in Q are ordered by their window sizes.In that case, the delimiter query q

m

can be located in O(log | Q |)time with a binary search style algorithm.

Safe Inlier. A point p declared as inlier by query qmax

is notnecessarily an inlier for all queries in Q. However this is not thecase when p is a safe inlier. Once p is confirmed as a safe inlier byqmax

, p is guaranteed to be a safe inlier for all queries in Q.Since all queries in Q have the same slide size, their current win-

dow Wc

ends at the same time point. Therefore all queries in Qshare the same succeeding points of p in W

c

, namely the pointsarriving later than p, but earlier than the end of the window as de-picted in Fig. 5. This indicates that the k succeeding neighbors ofp discovered by q

max

are shared by all queries. Hence p is a safeinlier with respect to all queries in Q.

Based on this Safe-For-All property, once p is determined by K-SKY to be a safe inlier of q

max

, p can also be declared to be a safeinlier for all queries without requiring any further evaluation. It isthen safe to exclude p from further evaluation in any future window.Therefore significant CPU and memory resources are saved.

In summary, we conclude that we only need to detect and main-tain the skyband points for one single skyband query with respectto the outlier query with the largest window size. This then is suf-ficient to answer all outlier queries in the query group. Clearly, fullsharing is achieved.

4.2 Varying the Slide ParameterNext, we consider the case where all queries have the same win-

dow size, while their slide sizes vary. Unlike the previous varyingwindow size case, these queries are not synchronized. That is, theirwindows move at a different pace. Therefore no stable relationshipholds across the data populations covered by the active windowswith respect to different queries. In other words there is no suchquery whose active window continuously contains the windows ofother queries. Therefore the above strategy supporting queries withvarious window sizes does not handle this case. However indepen-dently generating output for this set of queries at different momentsis not practical for large volume streams.

To solve this problem, given a query group Q, we build a singleswift query q

sft

that correctly answers all member queries of Q.Q

sft

has the same window size as all member queries in Q , whileits slide size is set as the greatest common divisor on the slide sizesof all the queries in Q.

Intuitively by definition of the greatest common divisor, 8qi

2Q, we have q

i

.slide mod qsft

= 0. Therefore at any time tj

whenqi

2 Q produces an outlier result qi

.outlier , qsft

would also beproducing result q

sft

.outlier . Furthermore, since qi

.win = qsft

.win ,the points covered by the window of q

i

and qsft

would be identicalat t

j

. Therefore at any tj

qi

.outlier = qsft

.outlier . Hence Qsft

issufficient to represent all queries in Q.

Therefore a query group Q with varying slide sides can be sup-ported by one special skyband query with respect to this specialoutlier query q

sft

. It is straightforward to determine at runtimewhen to output the outlier detection results and what query the out-

534

Page 9: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

put corresponds to by tracking for each query when the windowslides.

Safe Inlier. Although potentially qsft

slides its window morefrequently than any q

i

in Q, this swift query solution does wastesneither CPU nor memory resources. This is because q

sft

is able todiscover safe inliers and to terminate the outlier detection processat the earliest possible moment. This observation relies on the SafeFor All property of q

sft

similar to qmax

in the varying window sizecase. Namely given a point p, if p is recognized as a safe inlier forthe swift query q

sft

, then p is a safe inlier for all qi

2 Q. Next webriefly justify this observation.

We use succ(p, q) to denotes the points arriving later than p inthe current window W

c

of query q. The safe-for-all property fol-lows immediately from the fact that in any future window succ(p, q

sft

)

is a subset of succ(p, qi

) for any query qi

2 Q. This is so becausein any future window when query q

i

is scheduled to produce outlierstatus for p, all points succeeding to p in the current window of q

sft

will not expire (since p has not expired). Furthermore, qi

will alsoget some additional new points into the future window.

Since qsft

is potentially scheduled more frequently than any queryin Q, the safe inliers will be discovered quicker.q

sft

is able to dis-cover and prune the safe inliers earlier than any query in Q. There-fore CPU and memory resources are saved.

In conclusion, this swift query solution achieves full sharing byutilizing only one single skyband query to answer all member queriesin Q. Furthermore safe inliers are also discovered and discardedearlier than any actual member query, leading to additional savingin CPU and memory resources.

4.3 Varying Both Win and Slide ParametersWe now describe our solution for the case when both window

parameters, namely win and slide, vary. This solution is a straight-forward combination of the techniques introduced in the last twosections. In particular, we simply build one single swift query thathas the largest window size among all member queries and its slidesize as the greatest common divisor of the slide sizes of all memberqueries. A specific skyband query with respect to this single swiftquery will then be employed to collect skyband points, namely theevidence to prove the outlier status of a given point p. Similar tothe varying slide size case the timing when each query is requiredto produce output is determined at runtime (see Sec. 4.2). ThenLemma 3 introduced in Sec. 4.1 is applied to decide the outlierstatus of each point for the queries requiring output based on theresults of the swift skyband query. In short, this case of arbitrarywindow and slide sizes can be regarded as an arbitrary window sizecase with a fixed slide size whose value is the greatest common di-visor of the slide sizes of all member queries.

5. VARYING ALL PARAMETER SETTINGSFinally, we consider the most general case with arbitrary pat-

tern and window-specific parameters. Although sharing among agroup of totally arbitrary queries appears hard at first sight, wenow demonstrate that this problem can be tackled utilizing the sky-band query technique. This is possible, because as shown in Sec.4.3, the skyband query technique designed for processing the out-lier analytics workloads with varying pattern-specific parameterscan be leveraged to answer multiple queries with arbitrary window-specific parameters.

SOP Outlier Detection Framework. As depicted in Fig. 6,SOP first employs a query parser to divide the queries in a querygroup Q into sub-groups Q

i

based on their k parameters. Querieswith the same k parameter are grouped into one sub-group Q

i

. Thequeries in each sub-group Q

i

are then sorted based on their r pa-

Figure 6: SOP Frameworkrameters. The queries with same r parameters are further sortedbased on their window sizes. Then the query parser will create oneskyband query Qs

i

for each outlier query sub-group Qi

. Its windowsize is set as the the largest window size among the member queriesin Q

i

. Its slide size is then set as the greatest common divisor ofthe slide sizes of the member queries.

After the query parser transforms the outlier detection queriesinto the skyband queries, the K-SKY algorithm for multiple sky-band queries introduced in Sec. 3.2 will be applied to detect theskyband points. Then the outlier status evaluator determines theoutlier status of each data point with respect to the outlier queriesusing the inlier rule introduced in Sec. 3.2.2.

Similar to the varying window sizes case if a data point p is clas-sified as an outlier for the queries in some sub-group Q

i

, then p isguaranteed to be an outlier for all queries in Q

i

no matter what theirwindow sizes are. On the other hand if p is declared to be an inlierfor some queries, Lemma 3 has to be applied to evaluate whether pis indeed an inlier for these queries by checking their window sizesas shown in Sec. 4.1.

Once the outlier status of p is determined for certain queries, thequery scheduler determines whether it is time to output the outliersfor these queries based on their slide sizes (per Sec. 4.2).

Conclusion. Computation-wise, SOP only requires a single passthrough new data points, each collecting the minimum evidence toprove its outlier status with respect to all queries. Memory-wise,the evidence which proves the outlier status of each data point withrespect to multiple queries is maintained only once. In short, SOPachieves full sharing for multiple distance-based outlier queriesover sliding windows in terms of both CPU and memory resources.

6. PERFORMANCE EVALUATION

6.1 Experimental MethodologyWe conducted experiments on a PC with 3.4G HZ Intel i7 pro-

cessor and 6GB memory, running Windows 7 OS. All algorithmsare implemented in JAVA on HP CHAOS stream engine [8].

Real Data Sources. We use the Stock Trading Traces Data(STT) [11]. It has one million transaction records throughout thetrading hours of one day. All data has the same format of name,transId, time, volume, price, and type.

Synthetic Data. We also implement a data generator to cre-ate a dataset containing 100M points. This dataset is composed ofGaussian distributed data points as inlier candidates and uniformdistributed ones as outliers. The outliers are randomly distributedin each time segment of the data stream.

Alternative Algorithms. We compare our proposed SOP al-gorithm with the two state-of-the-art solutions from the literature[13, 7]. Since MCOD [13] does not support variations in window-specific parameters, we have extended MCOD by inserting our

535

Page 10: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

10 100 500 1000101

102

103

104

105

106

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) CPU (log scale)

10 100 500 1000200

300

400

500

600

700

800

900

1000

Number of Queries

Mem

ory

Foot

prin

t (M

B)

SOPMCODLEAP

(b) Memory (log scale)

Figure 7: Varying r values for queries on synthetic dataset

10 100 500 1000101

102

103

104

105

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) CPU (log scale)

10 100 500 10000

100

200

300

400

500

600

700

800

900

Number of Queries

Mem

ory

Foot

prin

t (M

B)

SOPMCODLEAP

(b) Memory (log scale)

Figure 8: Varying k values for queries on synthetic dataset

window-specific techniques into MCOD. We now use this enhancedalgorithm to compare against SOP. In addition, we also compareour sharing strategy against the state-of-the-art single query strat-egy LEAP [7]. Multiple queries are supported by applying LEAPindependently to process each query in the query group.

Metrics. We measure two metrics common for stream systems,namely the average processing time (CPU time) per window andthe peak memory consumption (MEM). The CPU time per win-dow corresponds to the total amount of system time resources usedto process the queries on the data in one window. The consumedmemory metric corresponds to the memory required to store the in-formation for each active object (i.e. the skyband points) and theoutliers of all queries in the current window. All results are col-lected and calculated at the unit of one window at a time. Thenthey are averaged over all windows processed in the given experi-ments. All experiments are reported using the count-based window,with time-based window processing achieving similar results.

We also conduct scalability tests to validate the performance ofthe proposed algorithms with an increasing number of queries inthe workload.

All in all our study covers important combinations of the fourquery parameters. They range from varying one specific parameteronly at a time to the more general cases of varying all 4 of themamong the queries populating the workload as shown in Table 1.The varying ranges of the parameters are listed in Table 2.

Workload Pattern WindowR K W S

(A) arbitrary fixed fixed fixed(B) fixed arbitrary fixed fixed(C) arbitrary arbitrary fixed fixed(D) fixed fixed arbitrary fixed(E) fixed fixed fixed arbitrary(F) fixed fixed arbitrary arbitrary(G) arbitrary arbitrary arbitrary arbitrary

Table 1: Combinations of different workloads

Type Name ValuePattern K [30,1500)

R [200,2000)Window W [1Ks,500Ks)

S [50s,50Ks)

Table 2: The ranges of the parameters

6.2 Varying Pattern Specific ParametersOur general methodology is to prepare four workloads with 10,

100, 500, 1000 queries respectively by randomly choosing the val-ues of the pattern-specific parameters in a range for each query,while fixing the window-specific parameters. The synthetic dataset

0.010.1 1 5x 104

101

102

103

104

105

106

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) CPU (log scale)

0.010.1 1 5x 104

0

500

1000

1500

2000

2500

3000

3500

4000

Number of Queries

Mem

ory

Foot

prin

t (M

B)

SOPMCODLEAP

(b) Memory (log scale)

Figure 13: Varying K, R, W and S values on synthetic dataset

is utilized in this set of experiments to make sure the outlier rate issmall (< 5%) when varying the k and r parameters.

Arbitrary R Case. In the first experiment, we evaluate the per-formance of our SOP compared with the state-of-the-art MCOD[13] and LEAP [7] when only varying parameter r. We fix the win-dow size to 10K, slide size to 0.5K and k parameter value to 30,while r is randomly selected in the range from 200 to 2000.

As shown in Fig. 7(a), SOP significantly outperforms MCODand LEAP up to 3 orders of magnitude in CPU time. By mappingthe multiple outlier query problem to the skyband query problem,SOP only needs to collect minimum information to prove the out-lier status of each data point with respect to all queries. Instead,MCOD relies on routinely conducting a range query to detect out-liers. That is, in each case it will compare each data point with allthe other data points in each window and collect all the points sat-isfying the neighbor condition of any user query. On the other handLEAP repeatedly detects outliers for each query from scratch. ItsCPU performance thus degrades quickly as the number of queriesincreases. We thus confirm that the CPU efficiency of both MCODand LEAP is significantly worse than that of SOP.

SOP is also superior in memory usage as shown in Fig. 7(b).This is because in each window given a data point p, MCOD keepsall data points satisfying the neighbor conditions of p with respectto any query. SOP instead determines that it is not necessary for theevaluation of the outlier status of p. On the other hand, LEAP, with-out leveraging the sharing opportunities across multiple queries,maintains the neighbors of each point independently for each query.

Arbitrary K Case. In this experiment, we analyze the perfor-mance of SOP by varying k parameter values of the queries. Weuse a fixed window size of 10K and a slide size of 0.5K. Parameterr is fixed at 700. A value for k is randomly selected in the rangefrom 30 to 1500 for each query.

As shown in Fig. 8(a), similar to the case of varying r, the CPUperformance of SOP outperforms the other two alternatives up to 4orders of magnitude. Since the case of varying k can be treated asa special case of the case with arbitrary k and r values, this experi-ment demonstrates the effectiveness of our K-Sky algorithm whenhandling multiple skyband queries.

536

Page 11: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

10 100 500 1000100

101

102

103

104

105

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) CPU (log scale)

10 100 500 10000

200

400

600

800

1000

Number of Queries

Mem

ory

Foot

prin

t (M

B)

SOPMCODLEAP

(b) Memory (log scale)

Figure 9: Varying k and r values on synthetic dataset

1 2 4 8101

102

103

104

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) Same attribute

1 2 3 4101

102

103

104

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(b) Different attributes

Figure 10: CPU (log scale): small workload

10 100 500 1000101

102

103

104

105

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) CPU (log scale)

10 100 500 10000

100

200

300

400

500

600

700

800

900

Number of Queries

Mem

ory

Foot

prin

t (M

B)

SOPMCODLEAP

(b) Memory (linear scale)

Figure 11: Varying Win for queries on STT dataset

10 100 500 1000101

102

103

104

105

106

Number of Queries

CPU

Tim

e (m

s) o

n Lo

g Sc

ale

SOPMCODLEAP

(a) CPU (log scale)

10 100 500 10000

200

400

600

800

1000

1200

Number of Queries

Mem

ory

Foot

prin

t (M

B)

SOPMCODLEAP

(b) Memory (log scale)

Figure 12: Varying W and S values on STT dataset

The CPU resources utilized by SOP are very stable as the num-ber of queries increases. This is because for each workload the kvalue is randomly selected in the same range. In each workload atleast one of the randomly selected k is likely to get fairly close tothe upper ceiling value in the range. In other words the maximum kin each workload is similar on average. Therefore this experimentdemonstrates that the performance of SOP relies on the largest kvalue instead of on the number of queries in the workload. There-fore SOP scales to a potentially huge workload composed of a largenumber of queries. A similar trend can also be observed in memoryutilization as shown in Fig. 8(b).

Arbitrary K and R Case. In this experiment, we assess theperformance of the algorithms when varying both k and r. We fixthe window size to 10K, slide size to 0.5K, while the values forboth k and r are randomly generated in the range respectively from30 to 1500 and from 200 to 2000 for each query.

Fig. 9 depicts the performance of the three algorithms in termsof CPU costs and memory consumption. We observe that SOPconsistently outperforms MCOD and LEAP up to 3 orders of mag-nitude. This confirms that K-SKY not only effectively shares thecomputation among the queries with an identical k parameter, butit also achieves full sharing across multiple skyband queries withrespect to different query groups with distinct k values. MCODinstead solves this case by simulating an outlier query using thelargest k and smallest r values in the workload as its pattern pa-rameters. Such a query can have much more restricted neighborrequirements and in turn more expensive range queries than any ofthe actual outlier queries. Huge CPU and memory resources maybe wasted compared to SOP.

Small Workload. In this set of experiments, we test the perfor-mance of SOP when processing small workload. It is composedof two experiments. In the first experiment, all queries utilize thesame set of attributes in the detection of outliers. We vary the sizeof the workload by containing 1, 2, 4, 8 queries. Again it confirmsthat SOP performs well even in this small workload case as shownin Fig. 10(a). In particular when the workload contains only onesingle query, SOP does not perform worse than the state-of-the-art

single query approach LEAP. This shows that no much extra over-head is introduced by SOP.

Furthermore, we also evaluate the performance of SOP whenhandling queries utilizing different set of attributes. In this exper-iment, the queries are divided into 3 groups. The queries in thesame group utilize the same set of attributes. We vary the num-ber of queries in each group from 1 to 4. To support such work-load SOP is slightly extended using a simple divide and conquerapproach. As depicted in Fig. 10(b) our extended SOP approachcontinues to perform well. More specifically SOP is at least 150times faster than MCOD and two times faster than LEAP even inthis small number of queries case.

6.3 Varying Window Specific ParametersNext, we focus on workloads composed of 10, 100, 500, 1000

queries respectively for the case when varying window-specific pa-rameters, while using a fixed pattern-specific parameter setting. Inthis set of experiments the stock data [11] is utilized to evaluatehow our SOP solution performs when handling real datasets.

Arbitrary Win Case. In this experiment, we study the perfor-mance of SOP for window sizes ranging from 1K to 500K. We fixthe slide size as 0.5K, r as 200, and k as 30.

As shown in Fig. 11, SOP features significantly better perfor-mance on both CPU and memory consumption compared to MCODand LEAP. Since MCOD leverages the sharing opportunities acrossthe windows of multiple queries by adopting our swift query strat-egy, its CPU and memory usage is relatively stable compared toLEAP as the number of queries increases. However MCOD is stilloutperformed by SOP by at least 2 orders of magnitude in CPUtime as shown in Fig. 11(a). This is because based on our safe-for-all observation in Sec. 4.1, SOP terminates and excludes p fromany future evaluation process immediately once p is classified assafe inlier by the skyband query corresponding to the outlier querywith the largest window size. As stated earlier, MCOD instead re-lies on a range query to detect these outliers. Even if a data point isrecognized as safe inlier, this neighbor search continues completingthe comparisons with all other untouched data points. Therefore ahuge amount of CPU resources can be wasted.

537

Page 12: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

Arbitrary Win and Slide Case. In this experiment, we investi-gate the performance of SOP when varying both window-specificparameters. We fix k as 30 and r as 200. The window and slidesizes are arbitrarily selected for each query from the range of 1K to500K and from 50 to 50K respectively.

As illustrated in Fig. 12, the average CPU time consumed bySOP increases only from 28ms to 282ms (10 folds) as the numberof queries increases from 10 to 1000 (100 folds). This continuesto outperform the alternative algorithms by at least two orders ofmagnitudes. Clearly, results shown in Fig. 12 demonstrate the ef-fectiveness of the swift query strategy for handling arbitrary winand arbitrary slide case.

6.4 Varying Pattern and Window ParametersIn this most general case, we prepare four workloads composed

of 100, 1000, 10,000, 50,000 queries respectively by varying allwindow-specific and pattern-specific parameters.

We observe from Fig. 13(a) that similar to the cases of inde-pendently varying pattern-specific parameters and window-specificparameters SOP achieves tremendous gain in CPU utilization com-pared to (augmented) MCOD and (the non-shared) LEAP. Fur-thermore SOP shows excellent scalability in the cardinality of theworkload. As the number of queries rises from 1000 to the ex-tremely large cardinality of 50,000 queries, the CPU costs of SOPonly increase from 32ms to 892ms. As the number of the queriesincreases, the sharing opportunities among the given set of queriesalso increase. Since SOP achieves full sharing across queries, iteffectively reduces the CPU burden caused by the huge workload.

The memory usage of SOP also consistently outperforms the al-ternatives solutions as shown in Fig. 13(b). As previously stated,the reason is that LEAP detects outliers on the same streaming datafor each query independently. Hence the memory consumed by theworkload queries accumulates as the number of queries grows. Onthe other hand, MCOD always detects outliers by discovering andmaintaining all neighbors for each point. However, SOP only re-quires minimal information to prove the outlier status of each point.As a result, SOP effectively avoids the usage of unnecessary spaceby purging redundant intermediate results. Therefore significantmemory utilization is reduced by SOP.

7. RELATED WORKDistance-based Outliers on Streaming Data. With the emer-

gence of digital devices generating data streams, outliers on stream-ing data are one type of anomalies recently studied [7, 13, 1]. Ex-isting work [7, 13, 1] focuses primarily on processing a single out-lier detection request. In particular [1] leverages the observationthat the neighbors p

i

of a point p that arrived after p do not expirebefore p expires. They make a distinction between the precedingneighbors of p, P

p

, i.e., those that will expire before p, and the suc-ceeding neighbors of p, S

p

, those that will persist during the entirelifetime of p. They first introduce the idea of a “safe inlier” as apoint p with � k succeeding neighbors.

[13] further improves on [1] by leveraging the safe inlier conceptof [1]. That is, it organizes the data points into a queue based onthe number of their succeeding neighbors, so that it can efficientlyschedule the necessary checks that have to be made when the win-dow slides. However it still relies on full range query searchesto process the newly arriving points. Therefore it cannot providereal time responsiveness when applied to high velocity streamingdata. Besides this single query technique, [13] also touches on sup-porting multiple outlier detection queries for the case of varyingpattern-specific parameters. Given a data point p they first utilizea range query to find all points that satisfy the neighbor condition

of all queries in the query group. Then a postprocessing step is ap-plied to filter the unnecessary points from this large neighbor set toreduce the maintenance costs. In our work by directly transform-ing the multi-query outlier problem into the single query skybandproblem, we only collect the necessary evidence that is sufficient toanswer multiple outlier queries, therefore significantly outperform-ing this method as confirmed in our experiments.

Due to a deeper understanding the temporal relationships amongstream data points, [7] overcomes the limitation of prior methods[1, 13] of undertaking full range query searches by discovering asearly as possible safe inliers in the scan process. This allows it tobetter satisfy the performance requirements of modern streamingapplications. However multiple outlier detection requests are notsupported in [7].

Outliers on Sensor Streams. [20] proposes an online techniqueto detect outliers in sensor stream data. This work primarily focuseson how to reduce the message transmission and in turn reduce thepower consumption of the sensors. First, it utilizes kernel den-sity estimation to model the distribution of the sensor data. Thengiven a point p, [20] approximates whether p is an outlier by es-timating the number of its neighbors using the density distributionfunction f (p). However unlike our work focusing on detecting ex-act distance-based outliers, this approximation no longer assurescomplete results. Furthermore, they only support a single outlierdetection query. No multi-query technique is discussed.

kNN Queries on Streams. Continuous kNN queries over slid-ing windows have indeed been studied in [15, 6]. Both works use agrid to index the stream data. To improve response time, they eitherpostpone the processing of the new points which are not likely tobe in kNN set [6] or eagerly pre-compute the possible kNN set foreach future window as new data arrives [15].

However to determine the outlier status of p it is not necessaryto always discover the full kNN of p. Rather one algorithm shouldstop evaluating p as long as any k neighbors of p have been discov-ered. Therefore the use of the streaming kNN algorithm to con-tinuously detect outliers is not an efficient approach. Instead ourcustomized skyband algorithm K-SKY always discovers the mini-mal information necessary to determine the outlier status of p.

General Multi-Query Optimization. Multiple query sharinghas been widely studied as a general optimization problem in stream-ing environments. Previous research on sharing computations stud-ied traditional SQL queries such as selection, join, and aggregation[4, 9, 22, 14]. Their methods include rewriting queries to exposecommon subexpressions, sharing indices, or segmenting input intopartitions and sharing partial results over the partitions. Howeverthe key problem we address in this work, namely correctly answer-ing multiple outlier detection queries by only collecting minimalinformation, is different from the more general purpose optimiza-tion effort required by the traditional SQL query sharing.

8. CONCLUSIONIn this work, we present the first solution, called SOP, for effi-

cient shared processing of a large number of distance-based outlierdetection requests with diverse parameter instantiations over slid-ing window streams. SOP requires only one single pass over thedata points to support a huge workload composed of a large num-ber of outlier detection requests with arbitrary input settings for allpattern and window specific parameters. Our experimental studyusing both real and synthetic streaming datasets confirms that forthe rich diversity of tested scenarios SOP outperforms the alter-natives on average three orders of magnitude in CPU time, whileusing only 5 % memory space of its counterparts.

538

Page 13: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

9. REFERENCES[1] F. Angiulli and F. Fassetti. Distance-based outlier queries in

data streams: the novel task and algorithms. Data Min.Knowl. Discov., 20(2):290–324, 2010.

[2] F. Angiulli and C. Pizzuti. Fast outlier detection in highdimensional spaces. In PKDD, pages 15–26, 2002.

[3] A. Arasu, S. Babu, and J. Widom. The cql continuous querylanguage. VLDB J., 15(2):121–142, 2006.

[4] A. Arasu and J. Widom. Resource sharing in continuoussliding-window aggregates. In VLDB, pages 336–347, 2004.

[5] S. D. Bay and M. Schwabacher. Mining distance-basedoutliers in near linear time with randomization and a simplepruning rule. In KDD, pages 29–38, 2003.

[6] C. Böhm, B. C. Ooi, C. Plant, and Y. Yan. Efficientlyprocessing continuous k-nn queries on data streams. InICDE, pages 156–165, 2007.

[7] L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A.Rundensteiner. Scalable distance-based outlier detection overhigh-volume data streams. In ICDE, pages 76–87, 2014.

[8] C. Gupta, S. Wang, I. Ari, M. C. Hao, U. Dayal, A. Mehta,M. Marwah, and R. K. Sharma. Chaos: A data streamanalysis architecture for enterprise applications. In CEC,pages 33–40, 2009.

[9] M. A. Hammad, M. J. Franklin, W. G. Aref, and A. K.Elmagarmid. Scheduling for shared window joins over datastreams. In VLDB, pages 297–308, 2003.

[10] D. M. Hawkins. Identification of Outliers. Springer, 1980.[11] I. INETATS. Stock trade traces. http://www.inetats.com/.[12] E. M. Knorr and R. T. Ng. Algorithms for mining

distance-based outliers in large datasets. In VLDB, pages392–403, 1998.

[13] M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas,and Y. Manolopoulos. Continuous monitoring ofdistance-based outliers over data streams. In ICDE, pages135–146, 2011.

[14] S. Krishnamurthy, C. Wu, and M. J. Franklin. On-the-flysharing for streamed aggregation. In SIGMOD Conference,pages 623–634, 2006.

[15] K. Mouratidis and D. Papadias. Continuous nearest neighborqueries over sliding windows. IEEE Trans. Knowl. DataEng., 19(6):789–803, 2007.

[16] A. Nazaruk and M. Rauchman. Big data in capital markets.In SIGMOD Conference, pages 917–918, 2013.

[17] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressiveskyline computation in database systems. ACM Trans.Database Syst., 30(1):41–82, 2005.

[18] S. Ramaswamy, R. Rastogi, and K. Shim. Efficientalgorithms for mining outliers from large data sets. InSIGMOD Conference, pages 427–438, 2000.

[19] G. E. Rosario, E. A. Rundensteiner, D. C. Brown, and M. O.Ward. Mapping nominal values to numbers for effectivevisualization. In InfoVis, pages 113–120, 2003.

[20] S. Subramaniam, T. Palpanas, D. Papadopoulos,V. Kalogeraki, and D. Gunopulos. Online outlier detection insensor data using non-parametric models. In VLDB, pages187–198, 2006.

[21] Y. Tao and D. Papadias. Maintaining sliding windowskylines on data streams. IEEE Trans. Knowl. Data Eng.,18(2):377–391, 2006.

[22] S. Wang, E. A. Rundensteiner, S. Ganguly, and S. Bhatnagar.State-slice: New paradigm of multi-query optimization of

window-based stream queries. In VLDB, pages 619–630,2006.

APPENDIXA. PROOF OF PROBLEM MAPPING

Lemma 1. Given a query group Q, for any data point p, theoutput of Qs , denoted as S

p

, is sufficient and necessary to contin-uously determine the outlier status of p with respect to all queriesin Q, where Qs represents the (k � 1 )-skyband query with respectto p with k as the neighbor threshold parameter defined in Q.Proof. “)” we prove the sufficiency by showing that Qs alwaysreturns the k nearest neighbors of p kNN(p) 3 as part of the sky-band points. Assume the data points in D have been partitionedinto t subsets S1, S2, ..., S

t

. Each subset Si

contains all datapoints with the same distance to p. Assume here that the subsetsS1, S2 ,..., S

t

are sorted in an ascending order by their distance to p.Assume | S

1

| + | S2

| +, ...,+ | Sx

|< k and | S1

| + | S2

| +...+ | S

x

| + | Sx+1

|� k . All data points in S1

[ S2

[ ... [ Sx

along with (k - (| S1

| + | S2

| +, ...,+ | Sx

|)) data points withlargest timestamp in S

x+1

will be returned by Qs , since by thedomination rule in Def. 5 they are dominated by at most k -1 datapoints. These k data points satisfy the kNN (p) definition, since noother data point is closer towards p than they are. Then by the k-distance observation, kNN(p) is sufficient to prove the outlier statusof p with respect to Q. The sufficiency is proven.

“(” By contradiction. Given a point pi

(pi

2 Sp

- kNN(p)), al-though by the k-distance observation, S

p

- {pi

} is still sufficient todetermine the status of p in the current window W

c

, potentially pmay be erroneously evaluated in the future window W

c+x

. Assumethat when the stream evolves to w

c+x

, all points in Sr

arriving ear-lier than p

i

have expired. If all points pj

arriving after Wc

are farfrom p (dist(p, p

j

) > ri

), pi

should now be in kNN(p). Since theoutlier status of p is determined by kNN(p), to correctly determinethe outlier status of p in any future window, p

i

has to be kept. Thenecessity is proven. ⌅

B. OPTIMALITY PROOF OF K-SKYLemma 2. K-SKY correctly discovers the (k-1)-skyband points

in window Wc

by examining only the minimum number of datapoints.Proof. We prove Lemma 2 by showing that: (1) Any point insertedto skybandPoints during the execution of K-SKY is guaranteed tobe a true (k-1)-skyband point; (2)No data point that could not be a(k-1)-skyband point is examined during the execution of K-SKY.

Proof of (1). In K-SKY the later arrivals are always evaluatedearlier than the earlier arrivals. Therefore any point p

i

alreadyadded into skybandPoints has a larger timestamp than any pointpj

remaining to be evaluated. That is, pj

.time < pi

.time . By thedomination rule defined in Def. 5, p

j

cannot dominate pi

. There-fore p

i

would not be replaced by any point evaluated later. Condi-tion (1) holds.

Proof of (2). Proof in two steps. First, K-SKY stops immedi-ately once the termination condition is satisfied. Namely K-SKYterminates immediately once one point p

i

is dominated by k pointsif the distance between p

i

and p is not larger than the smallest rparameter r

min

in Q. Therefore the remaining points that will notbe in skybandPoints are not evaluated.

Second, we prove any data point evaluated during the executionof K-SKY is potentially a (k-1)-skyband point. Data point p

i

isevaluated by K-SKY if and only if the termination condition has3This kNN is based on the normalized distance function.

539

Page 14: Sharing-Aware Outlier Analytics over High-Volume Data Streamspeople.csail.mit.edu/lcao/papers/SIGMOD16.pdf · 2019-10-22 · Sharing-Aware Outlier Analytics over High-Volume Data

not yet been satisfied. In other words when pi

was evaluated, atmost k - 1 data points p

j

with dist(p,pj

) rmin

existed at thattime. Therefore p

i

should be listed in skybandPoints if dist(p,pi

) r

min

. That is, if we were not to consider pi

, then potentially anincorrect skyband point set may be reported. Furthermore, in K-SKY if p 2W

c

is an point that survived the stream data expiration,only the new arrivals in W

c

and its unexpired unexpired skybandpoints in last window W

c�1

will be examined. By our least exam-ination optimization principle these points are the only points thatwill appear in the skyband point set of the new window W

c

. Thisconfirms that any point evaluated by K-SKY is indeed necessary toguarantee the correctness of the Qs query. ⌅

C. PROOF OF LEMMA 3Lemma 3. Given a query group Q: {q

1

, q2

, ..., qm

, qm+1

, ...,,

qmax

} (q1

.win < q2

.win, < ..., < qmax

.win), p is an outlier forqueries {q

1

, q2

, ..., qm

}, if qm+1

.Wc

.start < pold

.time < qm

.Wc

.start ,where p

old

is the oldest point in qs

max

.skyband .Proof. qs

max

is a special skyband query with respect to the singleoutlier query q

max

(k , r). Since p is an inlier with respect to qmax

,the skyband set qs

max

.skyband contains k neighbors of p. pold

isdominated by k -1 points in qs

max

.skyband , because pold

.time <pi

.time (8pi

2 qsmax

.skyband). Sinceqm+1

.Wc

.start < pold

.time < qm

.Wc

.start , pold

falls in the cur-rent window W

c

of {qm+1

, ..., qmax

}, but is out of the Wc

of{q

1

, q2

, ..., qm

}. Therefore p has at most k - 1 neighbors in qs

max

.skyband for {q

1

, q2

, ..., qm

}. Furthermore, any point pj

out ofqs

max

.skyband cannot be a neighbor of p for {q1

, q2

, ..., qm

} intheir current window W

c

. Otherwise pold

would also be dominatedby p

j

. Hence it would in total be dominated by k points. Thiscontradicts the fact that p

old

is a skyband point of qs

max

. ⌅

540