642 ieee transactions on knowledge and data … · kde-track: an efﬁcient dynamic density...

KDE-Track: An Efficient Dynamic DensityEstimator for Data StreamsAbdulhakim Qahtan, Suojin Wang, and Xiangliang Zhang

Abstract—Recent developments in sensors, global positioning system devices, and smart phones have increased the availability

of spatiotemporal data streams. Developing models for mining such streams is challenged by the huge amount of data that

cannot be stored in the memory, the high arrival speed, and the dynamic changes in the data distribution. Density estimation is an

important technique in stream mining for a wide variety of applications. The construction of kernel density estimators is well

studied and documented. However, existing techniques are either expensive or inaccurate and unable to capture the changes in

the data distribution. In this paper, we present a method called KDE-Track to estimate the density of spatiotemporal data streams.

KDE-Track can efficiently estimate the density function with linear time complexity using interpolation on a kernel model, which is

incrementally updated upon the arrival of new samples from the stream. We also propose an accurate and efficient method for

selecting the bandwidth value for the kernel density estimator, which increases its accuracy significantly. Both theoretical analysis

and experimental validation show that KDE-Track outperforms a set of baseline methods on the estimation accuracy and

computing time of complex density structures in data streams.

Index Terms—Adaptive resampling, bandwidth selection, data streams, dynamic density estimation, interpolation

Ç

1 INTRODUCTION

RECENT advances in computing technology allow for col-lecting vast amount of data that arrive continuously in

data streams. Examples of data streams can be found infields such as sensor networks, mobile data collection plat-form, and network traffic. The data need to be processedand analyzed once they arrive. However, the unbounded,rapid and continuous arrival of data streams disallow theusage of traditional data mining techniques. Therefore, thedevelopment of algorithms for processing data streamsinstantaneously becomes highly important.

Density estimation has been widely used in variousapplications. Estimating the Probability Density Function(PDF) for a given data set provides knowledge about theunderlying distribution of the data. Consequently, denseregions can be recognized as clusters and quantities such asmedians and centers of clusters can be computed [1]. Bycontrast, sparse regions are reported as outliers that can beused for fault detection, e.g., in sensor networks [2].

This paper aims to estimate the dynamic density thatcomes with the evolving spatiotemporal data streams, e.g.,traffic streams in a city. In the year of 2013, more than 170mil-lion taxi trips were recorded in the city of New York.

Monitoring and visualizing the density of spatiotemporalstreamswill help on placing taxicabs [3], reducing ambulanceemergencies response time [4] and reflecting people’s interestat a particular location for specific seasons [5].

However, estimating the dynamic density that comeswith evolving spatiotemporal streams is a challenging task.Besides the problem of estimating the density using samplesdrawn from an unknown distribution in case of stationarydata, spatiotemporal data streams have more challengingproperties that complicate the estimation of density. First,the data distribution changes dynamically in an unpredict-able fashion. Therefore, density estimation should rely moreon the recently received data samples [6], [7], e.g., by using asliding window. Second, an anytime-available model shouldbe efficiently updated to allow real-time monitoring of thedensity. Meanwhile, the density function value of any newarriving data may need to be instantly estimated. Third, thespatial non-uniformity of data distribution requires higherresolutions in dense areas and lower resolutions in sparseareas, so that the estimation is accurate to catch the details.

Most of the existing approaches for estimating the den-sity of data streams are based on the Kernel Density Estima-tion (KDE) method due to its advantages for estimating thetrue density [8]. Given a set of samples, S ¼ fxx1; xx2; . . . ; xxng,where xxj 2 Rd. KDE estimates the density at a point xx as

fðxxÞ ¼ 1

n

Xnj¼1

Kh xx; xxj

� �; (1)

where Kh xx; xxj

� �is a kernel function, which is usually a radi-

ally symmetric unimodal function that integrates to 1. Eq. (1)shows that KDE uses all the data samples to estimate the PDFof any given point. In the problemof online density estimationof data stream, i.e., estimating the density of every arriving

� A. Qahtan and X. Zhang are with the Division of Computer, Electrical,and Mathematical Sciences & Engineering, King Abdullah University ofScience & Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.E-mail: {abdulhakim.qahtan, xiangliang.zhang}@kaust.edu.sa.

� S. Wang is with the Department of Statistics, Texas A&M University(TAMU), College Station, TX 77843. E-mail: [email protected].

Manuscript received 3 Dec. 2015; revised 24 Sept. 2016; accepted 3 Nov. 2016.Date of publication 8 Nov. 2016; date of current version 2 Feb. 2017.Recommended for acceptance by R. Gemulla.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2016.2626441

642 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 3, MARCH 2017

1041-4347� 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

data sample, KDE has quadratic time complexitywith respectto (w.r.t.) the stream size. Also, the space requirement forKDE significantly increaseswith the dataset size.

In order to reduce the high computational and spacecosts of KDE, the number of used samples was controlled tobe a small value, by different ways; 1) merging kernels1 in[1], [6], [7], [9] where each merged kernel summarizes a setof similar samples; 2) sampling [10], [11] where onlyselected kernels are used in the calculation of fðxxÞ. Table 1summarizes the key characteristics of the most popular den-sity estimators. Using merging or sampling reduces thecomputational cost of KDE, however, scarifying the estima-tion accuracy due to the reduction of used kernels. Morediscussions of the limitations of these approaches will bepresented in Section 2.

In this paper, we propose a method called KDE-Tracktomodel the data distribution as a set of resampling pointswith their estimated PDF. To guarantee the estimation accu-racy and to lighten the load on themodel, an adaptive resam-pling strategy is employed to control the number ofresampling points, i.e., more points are resampled in theareas where the PDF has a larger curvature, while less num-ber of points are resampled in the areas where the function isapproximately linear. In order to overcome the quadratictime complexity of KDE when evaluating the PDF for eachnew observation, linear interpolation is used with KDE foronline density estimation. This technique was used in [12]with univariate stationary data and known as Kernel Poly-gons. The new interpolation KDE computes the PDF value ofa new arriving data sample using interpolation of selectedresampling points. It therefore has advantages of evaluatingthe PDF for any new observation in linear time complexityand space complexity w.r.t. the number of resamplingpoints. Evaluating the PDF for all received observations willthen take linear time comparedwith the quadratic time com-plexity of KDE. To timely track the evolving density, we usea sliding window strategy in KDE-Trackto estimate the den-sity using themost recent data samples.

Our KDE-Trackhas unique properties as follows:

(1) it generates density functions that are available tovisualize the dynamic density of data streams at any

time. At any time t, after receiving one streamingdata sample xxt, KDE-Trackupdates the PDF of thedata stream and also estimate fðxxtÞ. As a real-worldapplication example, KDE-Trackis employed to visu-alize the density of pickup events in New York Taxidata in an online fashion and helps on finding inter-esting patterns.

(2) it has linear time and space complexities w.r.t. themodel size for maintaining the dynamic PDF of datastream upon the arrival of every new sample. It isthus 8� 85 times faster than the traditional KDEdepending on the window size;

(3) the estimation accuracy is guaranteed by adaptiveresampling and optimized bandwidth (h), whichalso address the spatial non-uniformity issue ofspatiotemporal data streams. Comparing with a setof baseline methods, it achieves the lowest estima-tion error, especially when the density function ismultimodal and complex.

Both theoretical analysis and experimental results onsynthetic and real-world data show the effectiveness of ourapproach for estimating the dynamic density functions thatcome with spatiotemporal data streams.

The rest of the paper is organized as follows: Section 2presents the related work. Section 3 discusses KDE and itsrelated issues. Section 4 introduces our approach KDE-Track. Section 5 discusses the implementation details ofKDE-Track. Section 6 presents the evaluation results of ourapproach and Section 7 concludes our work.

2 RELATED WORK

This section discusses the work that is related to ourstudy and also focuses on estimating the dynamic den-sity of data streams. We should consider not only theconstraints of using limited memory and processing thedata in real time ([16], [17]) due to the nature of streams,but also the dynamic changes of the underlying densityfunction over time.

To reduce the computational cost and space require-ment of KDE, methods have been proposed based on kernelmerging, sampling or space partitioning. Kernel merging isused in [1], [6] and [7] where a specific number of kernelsare maintained through merging two or more kernels.Each kernel summarizes a cluster of similar samples. Anew arriving sample can either fall into one existing kernel

TABLE 1Summary of the Key Characteristics of the Density Estimators (MV = Multivariate Data, RS = Random Sampling,

GS = Group Selection, SS = Sort Selection, and BU = Batch Update)

Method MV Datastreams

Bandwidthselection

Bandwidth on eachdimension in MV

data pointsreduction technique

MV Kernelfunction

Onlineupdate

CK [7] No Yes Normal rule N/A Merge kernels N/A YesM-Kernels [1] No Yes Normal rule N/A Merge kernels N/A YesSOMKE [9] Yes Yes Normal rule Different Trained neurons MV kernel function BUFFT-KDE [13] Yes Yes Normal rule Different none MV kernel function NoKDE [14] Yes Yes Normal rule Different none Product kernel function Nokd-tree [15] Yes No Cross validation Fixed none Rotation-invariant function NoRS, GS and SS [10] Yes No User input Fixed sampling Rotation-invariant function NoMPLKernels [11] Yes Yes Normal rule Different sampling Product kernel function YesKDE-Track(proposed) Yes Yes plug in Different none Product kernel function Yes

1. To be consistent with all KDE related work [1], [6], [7], [9], [10],[11], we also use the term kernel or kernel point to represent the data sam-ples, and kernel function to represent the functionKð�Þ.

QAHTAN ET AL.: KDE-TRACK: AN EFFICIENT DYNAMIC DENSITY ESTIMATOR FOR DATA STREAMS 643

or trigger a new kernel. Two kernels are merged if thenumber of kernels exceeds the specified number. Sincemerging kernels is a lossy approximation, cost functionswere proposed to decide which kernels to be merged bymeasuring the amount of discarded information duringthe merging process. The three methods differ in the wayof selecting the bandwidth value. Cluster Kernels (CK) [7]uses global bandwidth value for all kernels, M-Kernel [1]uses different bandwidth for each kernel point andAKDE [6] defines a set of local regions with minimumintra-density variations and uses different bandwidthvalue for each region. When estimating the PDF for onesample, all the methods require to resample from kernels.Single-value-resampling can be inaccurate since only onesample is used to represent all samples in that kernel.Regeneration of all samples in kernel might be accuratebut very costly in computation. Moreover, in [6], thedynamic changes in the data distribution of evolving datastreams will require rebuilding the model frequently todefine the set of local regions, which will be very costly.Another method merges kernel based on clustering bySelf-Organizing Maps (SOM) [9]. Only trained SOM neu-rons are utilized in density estimation, rather than thewhole set of kernels. In order to train the neurons and tominimize the time complexity, a data stream is consid-ered as a sequence of disjoint windows where data ineach window are assumed to have the same distribution.Depending on the window size, this assumption may beviolated in scenario of data streams.

Sampling was used in [10] to reduce the number of ker-nels in large datasets while guaranteeing an �-approxima-tion of the density function. The authors studied randomsampling and proposed group-selection and sort-selection, which achieve the same accuracy as randomsampling but with a smaller number of samples. Sam-pling can significantly reduce computation cost, but scari-fying estimation accuracy.

Space partitioning is also used to reduce the computa-tional cost of KDE. A kd-tree structure is used in [11] and[15] where the leaves contain a small number of kernelsand each internal node contains a statistical summaryabout the subspace represented by that node. Estimatingthe density at any given point involves depth-first tra-versal of the tree where only close-by nodes will be visited.These methods have the same problem of representingmany kernels by a single value. Controlling the size of thetree is another issue. Grid-based methods were presentedin [12], [18], [19], [20], [21] for static datasets and concen-trate on the best setting of the bin width using a fixed num-ber of resampling points. This approach will not work fordata streams as the data are not available in advance andthe range of the data is changing over time which requiresan adjustable model.

Our study differs from the above-mentioned methodsby updating the estimated density with the contributionof each new arriving sample. Hence, it provides any timeavailable density values, which can be used for visualiz-ing the density. The model size is controlled by adaptiveresampling, rather than reducing the number of usedkernels in [1], [6], [7], [9], [10], [11]. The estimation erroris thus minimized. In addition, it is deployed with an

accurate bandwidth selection method, which improvesthe density estimation significantly.

3 THEORETICAL BASES

In this section, we discuss the traditional KDE and itsrelated issues, e.g., selection of kernel functions andsmoothing parameter (bandwidth), and complexity.

KDE estimates the density fðxxÞ by Eq. (1). For the case ofunivariate data, Eq. (1) is written as

fðxÞ ¼ 1

nh

� �Xnj¼1

Kx� xj

h

� �: (2)

For the 2-d spatial samples, where xxj ¼ ðx1j; x2jÞT 2 R2, ker-

nel functions Kh xx; xxj

� �are defined as 1

h1h2Kðx1�x1j

h1;x2�x2j

h2Þ,

where hi is the smoothing parameter, called the bandwidth,on dimension i [8].

A popular kernel function in case of multivariate data iscalled the multiplicative (product) kernel [8], which usesthe product of univariate kernel functions on each dimen-sion, and computes fðxxÞ as

fðxxÞ ¼ 1

n

Xnj¼1

Y2i¼1

1

hiK

xi � xji

hi

� �� : (3)

Another option is to use the orientation-invariant kernelfunction [10] and [15], which is

fðxxÞ ¼ 1

nh2

� �Xnj¼1

Kkxx� xxjk

h

� �: (4)

This kernel function assumes that the data variation alongall the dimensions is the same, which may fail to capturedensities of arbitrary shapes.

3.1 KDE Related Issues

The choice of a kernel function is relatively unimportantprovided that a kernel function is continuous with finitesupport [14]. It is recommended that the selected kernel issmooth, clearly unimodal and symmetric about the origin[8]. In our density estimator, we choose the multiplicativeEpanechnikov kernel where the same univariate kernelfunction KðxÞ ¼ 3

4 ð1� x2ÞI½�1;1�ðxÞ is used in each dimen-sion with different bandwidth value. We used theEpanechnikov kernel because of its asymptotically-optimalefficiency among all other kernel functions [22].

The estimation accuracy of KDE is mainly affected by thebandwidth value [8], [14]. A large bandwidth value over-smooths the density function curve and hides a lot of usefulinformation, while a small bandwidth value makes the den-sity function’s curve too fluctuated. A general rule forselecting the bandwidth is to decrease the bandwidth value(h ! 0) as the number of samples used in the estimationincreases (n ! 1). However, the rate at which happroaches 0 is much slower such that (nh ! 1).

3.2 The Bandwidth h

Eqs. (1) and (3) use different bandwidth values to capturethe spread of the data on each dimension. This suggests


using the same analysis of estimating the bandwidth for thecase of univariate data on the marginal distribution of thedata on each dimension. A typical rule of bandwidth settingis to minimize the deviation between the true and the esti-mated densities. This deviation is commonly measured bythe Mean Integrated Square Error (MISE) [23]. Let

mkðxÞ; RðfÞ be defined as mkðxÞ ¼RxkKðxÞdx and RðfÞ ¼R

f2ðxÞdx. The MISE of the estimator using a bandwidth

value h is MISEðhÞ ¼ R E½fðxÞ � fðxÞ�2dx, which has the

asymptotic expansion MISEðhÞ ¼ AMISEðhÞþ Oðn�1 þ h5Þunder suitable regularity conditions on K and f . The mini-

mizer of the AMISEðhÞ ¼ 1nhRðKÞþ h4ðm2ðKÞ

4 Þ2R f 00ð Þ is con-

sidered a good approximation for the optimal bandwidthvalue, which can be estimated as

h ¼ RðKÞm22ðKÞRðf 00Þn

� �15

: (5)

However, this minimizer cannot be computed as it dependson the unknown density f .

Many methods have been introduced to estimateRðf 00Þ in Eq. (5). The normal rule [14] is the most popularmethod for estimating the bandwidth, which assumesthe unknown density f as a normal density and scales itaccording to the standard deviation of the data samples.The bandwidth value selected using the normal rule iscomputed as

h ¼ csn�1=5; (6)

where c is a constant that depends on the used kernelfunction K, s is the sample standard deviation and n isthe number of kernels. This method is timely efficientbut it does not work well when the density deviates sig-nificantly from normality. Other methods based oncross-validation have been proposed in the literature[24], [25], [26], [27], [28], [29]. These methods require per-forming density estimation for each candidate of thebandwidth values, which multiplies the computationalcost by another factor equal to the number of candidates.

Plug-in methods [30], [31], [32], [33] estimate anapproximation of Rðf 00Þ and plug it in Eq. (5) to computethe optimal bandwidth. Estimating Rðf 00Þ requires alsomaking assumptions about the density function but itbecomes more accurate than using the normal rule.Sheather and Jones [29] estimate Rðf 00Þ by estimating

f ð4Þ, which in turn is estimated using Rðf ð6ÞÞ. The value

Rðfð6ÞÞ is computed by assuming that f ð8Þ is the eighth

derivative of a normal density. After estimating Rðfð6ÞÞ,a backward substitution is performed to estimate Rðf 00Þ.Shimazaki and Shinomoto [34] assume that the true den-sity follows a Poisson distribution and use Estimation-Maximization (EM) method to find the optimal band-width value. The method requires estimating the densityfor each estimation step of the EM optimization proce-dure, which will be very expensive in the case of stream-ing data where the density is changing dynamically andthe bandwidth value needs to be estimated frequently.

In the case of multidimensional data, most of the band-width selection methods either consider a fixed bandwidth

value for all the dimensions [5], [10], [15], [35], [36], [37] oruse the marginal distribution to estimate the bandwidth oneach dimension [11], [14]. KDE fails to capture densities ofarbitrary shapes when using the same bandwidth value forall dimensions. Methods that select different bandwidthvalues for each dimension rely on the marginal distributionof the data on that dimension.

In this work, we minimize the effect of the normalityassumption of f by using the data samples to estimate f 00.The numerical integration technique is then used to com-pute Rðf 00Þ, which is plugged in Eq. (5) to estimate thebandwidths.

4 KDE-TRACK METHOD

In this section, we describe our model (KDE-Track) for esti-mating the dynamic density functions that come with datastreams. We theoretically analyze how KDE-Track mini-mizes the estimation error.

4.1 KDE-Track Overview

We model the distribution of the streaming data as a grid ofresampling points and their corresponding estimated den-sity values. Let U1 ¼ fu10; u11; . . . ; u1U1�1

g and U2 ¼ fu20; u

21; . . . ;

u2U2�1

g be the set of points that discretize the range of the

data on the first and the second dimensions, respectively.The KDE-Track model M is defined as the set of the grid

points from U1 � U2 with their estimated densities. That is,

M ¼ M0;M1; . . . ;Mq�1

�, where q ¼ U1U2 is the number of

the resampling points and Ms is an ordered pair represent-

ing a grid point and its estimated PDF ðMs ¼ ðmms; fðmmsÞÞÞ.Here mms ¼ ðu1

k; u2l Þ 2 U1 � U2 is the sth resampling point

with l; k being the quotient and the remainder of the divi-

sion of s by U1 and f mmsð Þ is the density estimated usingKDE atmms.

Density estimation using bilinear interpolation is basedon constructing the grid of resampling points and estimat-ing their corresponding density values. Estimating the PDFat a data sample aa by bilinear interpolation of the resam-pling points has two steps: (1) fetch the estimated PDF val-ues at resampling points mms1;mms1þ1;mms2 and mms2þ1 that

surround the point aa (as in Fig. 1). Let yðiÞ be the projection

of vector yy on i-axis, then mð1Þs1 ¼ m

ð1Þs2 � að1Þ < m

ð1Þs1þ1 ¼

mð1Þs2þ1 and m

ð2Þs1 ¼ m

ð2Þs1þ1 � að2Þ < m

ð2Þs2 ¼ m

ð2Þs2þ1; (2) estimate

the density at aa using the interpolation.

Fig. 1. Computing the density at aa by interpolation given the KDE estima-

tion f atmms1,mms1þ1,mms2, andmms2þ1.


4.2 Density Estimation by Interpolation

Estimating the density at aa using bilinear interpolation isdone by linearly interpolating the density at mms1;mms1þ1

to estimate the density at rrs1 and interpolating the den-sity at mms2;mms2þ1 to compute the density at rrs2. The finalstep interpolates the density at rrs1; rrs2 to find therequired density at aa. Let Dðbb; ccÞ be the euclidean dis-tance between bb and cc. The density at aa will be com-puted as follows:

~fðaaÞ ¼ D aa; rrs2ð Þ~f rrs1ð Þ þD rrs1; aað Þ~f rrs2ð ÞD rrs1; rrs2ð Þ ; (7)

where

~fðrrs1Þ¼D rrs1;mms1þ1ð Þf mms1ð ÞþD mms1; rrs1ð Þf mms1þ1ð ÞD mms1;mms1þ1ð Þ ;

and

~fðrrs2Þ¼D rrs2;mms2þ1ð Þf mms2ð ÞþD mms2; rrs2ð Þf mms2þ1ð ÞD mms2;mms2þ1ð Þ :

KDE interpolation is efficient as it stores only the function atthe resampling points whose total number is in the constantorder and is small compared to the stream size. The runningtime for estimating the PDF for all n arriving data sampleswill be in OðnjMjÞ.

4.3 Error Analysis

This section discusses the error incurred by our density esti-mator and targets at minimizing this error. KDE-Trackincurred three types of error: the estimation error inheritedfrom KDE, the interpolation error and the rounding error.Since rounding error (occurring when infinite number ofdigits after the decimal point are squeezed in a finite num-ber of bits) is machine dependent, we focus on the interpola-tion error and propose an adaptive resampling model tominimize this type of error. The error inherited from KDEwill be minimized by selecting the optimal bandwidth val-ues for the KDE.

In [38], we studied the interpolation error for the case ofunivariate data. Let fðaÞ and ~fðaÞ be the estimated PDFsusing the traditional KDE and the KDE-Track, respectively,and Dm be the maximum distance between two consecutiveresampling points. The error is found to be

~fðaÞ � fðaÞ ¼ D2m

8f 00ðaÞ þOp D3

m

� �:

The interpolation error will increase in the case of spatialdata. The bounding box of a sample aa contain four resam-pling points such that, in each dimension, the projection of aalies between two of these resampling points. An upperbound of the bilinear interpolation error can be found in the

followingway. Let f be the estimated density using the tradi-

tional KDE. Assume that f is twice continuously differentia-ble in open neighborhood around rrs1; rrs2 and aa. We use the

notation fxi ¼@f

@xiand fxixi ¼

@2f

@x2i

. Writing fðmms1Þ; fðmms1þ1Þ;fðmms2Þ and fðmms2þ1Þ as Taylor expansions of f around rrs1and rrs2, we get

fðmms1Þ ¼ fðrrs1Þ �D11fx1ðrrs1Þ þD2

11

2fx1x1ðrrs1Þ þOpðD3

11Þ

fðmms1þ1Þ ¼ fðrrs1Þ þD12fx1ðrrs1Þ þD2

12


12Þ;

fðmms2Þ ¼ fðrrs2Þ �D11fx1ðrrs2Þ þD2

11


11Þ;

fðmms2þ1Þ ¼ fðrrs2Þ þD12fx1ðrrs2Þ þD2

12


12Þ:

We interpolate these function values to compute the PDF atrrs1 and rrs2 to get

~fðrrs1Þ ¼ fðrrs1Þ þD12D11


11D312Þ:

~fðrrs2Þ ¼ fðrrs2Þ þD12D11


11D312Þ:

Writing fðrrs1Þ and fðrrs2Þ as Taylor expansion of f around aa,we get

fðrrs1Þ ¼ fðaaÞ �D21fx2ðaaÞ þD2

21

2fx2x2ðaaÞ þOpðD3

21Þ;

fðrrs2Þ ¼ fðaaÞ þD22fx2ðaaÞ þD2

22

2fx2x2ðaaÞ þOpðD3

22Þ;

and

fx1x1ðrrs1Þ ¼ fx1x1ðaaÞ þOpðD311D

312Þ;

fx1x1ðrrs2Þ ¼ fx1x1ðaaÞ þOpðD311D

312Þ:

Now, compute ~fðaaÞ by interpolating ~fðrrs1Þ and ~fðrrs2Þ

~fðaaÞ ¼ fðaaÞ þ 1

2

(D21D22fx2x2ðaaÞ þD11D12fx1x1ðaaÞ

)þ � � � :

Let Dm be the maximum distance between two consecutiveresampling points in any dimension. Then the maximum

error will be observed when D11 ¼ D12 ¼ D21 ¼ D22 ¼ Dm

2.

Then

~fðaaÞ ¼ fðaaÞ þD2m

8

�fx1x1ðaaÞ þ fx2x2ðaaÞ

þOpðD3

mÞ:

Note that the term OpðD3mÞ includes also the terms with

higher order derivatives. When using the Epanechnikov ker-nel function, the second partial derivative will be constantand the partial derivatives of higher orderwill be zeros.

4.4 Adaptive Resampling Model

From Section 4.3, we know that the accuracy of the linearinterpolation depends on 1) the distance between two adja-cent resampling points; and 2) the second derivative of thedensity function. To minimize the error while keeping thenumber of resampling points within a reasonable margin,we add more resampling points in the regions where thedensity function has high curvature, as shown in Fig. 2. Bycontrast, in the regions where the function is approximatelylinear, we use less resampling points.


In spatiotemporal data streams, the distribution is spatialnon-uniform and dynamic. Therefore, high resolution withsufficient resampling points is required 1) in dense areaswith high PDF values to catch the details; and 2) in sensitiveareas which are the boundary between dense and sparse tocatch dynamic changes. Adaptive resampling meets therequirement perfectly because both dense areas and sensitiveareas generally have density function with high curvature. Itdetects the high curvature by the second derivative of thedensity function, whichwill be introduced in next section.

4.5 Bandwidth Selection

Based on the discussion in Section 3.2, we use differentbandwidth values for each dimension. The marginal distri-bution of the data is used to compute the bandwidth valueon that dimension. From Eq. (5), the bandwidth value onthe dimension i 2 f1; 2g can be computed as

hi ¼ RðKÞm22ðKÞRðf 00

i Þn� �1

5

; (8)

where f 00i is the second derivative of the density function on

the ith dimension. We start by estimating a pilot bandwidth

using the normal rule ~hi ¼ csin�1=5, where c is a constant

that depends on the kernel functionK and si is the standarddeviation of the projection of the data on the axis i. Thispilot bandwidth is used to estimate the second derivative ofthe marginal distribution as

f 00i ðxÞ ¼

1

n~hi

Xnj¼1

K00 x� xj

~hi

� �: (9)

In this case, f 00i will be a better approximation of f 00i than con-

sidering fi to be a normal density. Using KDE-Track on theone dimensional data will speed up the computation of

Rðf 00i Þ and hi. Since the distribution of the data will changeover time with the arrival of new samples from the stream,

the bandwidth values hi should be updated accordingly torepresent the variation of the data along the differentdimensions. Using KDE-Track will also allow for updating

the values of hi online and efficiently.In this way, the estimated f 00

i will serve two roles. First,it is used to approximate Rðf 00

i Þ to compute the bandwidth

value. Second, it is used as a more accurate indicator of thehigh curvature of the density function’s curve, which facili-tates the adaptive resampling in KDE-Track for obtainingmore accurate estimation as we will discuss in the follow-ing sections.

Algorithm 1. KDE-Track: Online Density Estimation

Parameters: w (window size)Online flow in: streaming data S ¼ fxx1; . . . ; xxt; . . .gOnline output: ~fðxxtÞ (the PDF value at xxt)Procedure:1: # Initialization:2: W ¼ fxx1; . . . ; xxwg, update step ¼ 0:05w andM ¼ f

3: Compute hi1 ¼ ð RðKÞm22ðKÞRðf 00

iÞwÞ

15 (Section 4.5)

4: for k ¼ 0 to U1 � 1 do (Section 4.1)5: for l ¼ 0 to U2 � 1 do6: Put s ¼ lU1 þ k andmms ¼ ðu1

k; u2l Þ

7: Compute fðmmsÞ using Eq. (3)8: PutMs ¼ ðmms; fðmmsÞÞ9: M ¼ M[ fMsg10: end for11: end for12: while a new sample xxt arrives in the stream do13: if (1 � t � w) then14: Compute ~fðxxtÞ using Eq. (7)15: else16: # Update the resampling model: (Section 5.3)17: Remove xxt�w fromW and add xxt toW18: for each dimension i do19: Update sit using xxt; xxt�w

20: Compute ~hit ¼ csitw�1=5

21: Update f 00i ðui

kÞ22: end for23: for each dimension i do24: Compute Rðf 00

i Þ, hit ¼ ð RðKÞm22ðKÞRðf 00

iÞwÞ

15

25: 8 sð0�s�U1U2�1Þ, update fðmmsÞ by Eq. (11)# Update the adaptive resampling:

26: ifmodðt; update stepÞ ¼ 0 then

27: Compute �f 00i ¼ 1

Ui

PUi�1j¼0 jf 00

i ðuijÞj28: ifmaxðjf 00

i ðuikÞj; jf 00

i ðuik�1ÞjÞ > �f 00

i then

29: uitemp ¼ ðui

k � uik�1Þ=2

30: Compute f 00i ðui

tempÞ31: Insert ui

temp in Ui

32: end if33: Qi ¼ maxðjf 00

i ðuil�1Þj; jf 00

i ðuilÞj; jf 00

i ðuilþ1ÞjÞ

34: if Qi < 0:05�f 00i then35: Merge ½ui

l�1; uil � and ½ui

l ; uilþ1�

36: end if37: end if38: end for39: Compute ~fðxxtÞ using Eq. (7)40: end if41: end while

5 KDE-TRACK IMPLEMENTATION

Estimating the density for each incoming data samples usingKDE-Track requires access to four resampling points only asdiscussed in Section 4. The key step is thus the maintenance

Fig. 2. Adaptive resampling: More resampling points are used in regionswith high curvature of the function.


of the resampling model (resampling points and their PDFvalues). Algorithm 1 shows the maintenance of KDE-Track’smodel and using it for online density estimation. The linesstartingwith the # sign represent comments.

5.1 Initializing the Resampling Model

The resampling model is initialized by the beginning part ofstreaming data, e.g., the first 5,000 points.2 The resamplingpoints, mms; s ¼ 1; . . . ; q, are defined as the cartesian productof the set of equidistant points selected on each dimensionwithin the range of initial points received so far.3 The sec-ond derivative of the marginal density on each dimension isthen estimated at the initial resampling points and used forselecting the bandwidth value. Moreover, the second deriv-ative is used to add more resampling points in the regionswith high curvature of the density. Using the estimated

bandwidth value, the density values f mmsð Þ of these resam-pling points are computed using the traditional KDE on theinitial batch of points.

5.2 Estimation Based on the Resampling Model

Once all Mi have been initialized, we can estimate the den-sity at each arriving point xx. Due to our interpolationmethod, the density at xx can be estimated by 1) calculating,on each dimension, the index of one resampling point whoand whose successive neighbor will contribute; 2) fetchingthe four resampling points in the cell surrounding the point

xx and their densities; and 3) computing ~fðxxÞ using Eq. (7).

5.3 Updating the Resampling Model

The resampling model is the basis of our density estima-tor. The resampling points and their PDF values shouldbe updated after receiving a new data point. As we dis-cussed in Section 4.4, the resampling points are adap-tively maintained according to the curvature of densityfunction. An interval ½uij; ui

jþ1� is divided into two equal

subintervals when maxfjf 00i ðuijÞj; jf 00i ðui

jþ1Þjg > �f 00i , where

�f 00i ¼ 1

Ui

PUi�1j¼0 jf 00

i ðuijÞj is the average of the second deriva-

tive absolute values. In this way, more resampling pointsare inserted in areas which are the boundary betweendense and sparse or are dense with high peak values in

density function. Two intervals ½uil�1; u

il � and ½ui

l ; uilþ1� are

merged to reduce the number of resampling pointswhen the density function is close to linear, which

means jf 00i ðui

l�1Þj; jf 00i ðui

lÞj and jf 00i ðui

lþ1Þj are close to zero

(less than 0:05�f 00i ). In sparse regions, the PDF values areclose to zero and the function is almost linear so theintervals are also merged to reduce the number ofresampling points. Note that once a given interval hasbeen split, it will not be split again until it gets mergedwith another interval. The same condition is applied formerging the intervals.

When updating the densities of resampling points in themodel M, we should consider the evolution of the data dis-tribution. We use a sliding window strategy to catch theevolution over time. The window size w is an applicationdependent parameter and can be set based on the arrivalrate of the data samples and the time interval during whichwe need to estimate the dynamic density. The window sizealso controls the robustness of KDE-Track against noisydata where an isolated outlier will increase the height of thePDF curve by maximum 1=w, which will not be noticed.However, when a new pattern arrives, the new points willreplace points in the sliding window from the old patternand their contribution will be observed on the shape of thedensity function after receiving a reasonable number ofdata points from the new pattern. Let nt denote the numberof points we have received until time t. Due to the differencebetween w and nt, there are two different scenarios whenupdating the model M, more specifically, updating the

bandwidths and density f mmsð Þ:(i) When nt � w. The received points cannot fill the

whole window. The pilot bandwidth value at time tis calculated using all nt points by the formula~hit ¼ csitn

�1=5t , where s2

itis the sample variance of

the received data samples projected on the ith axis

calculated as si2t ¼ 1

nt�1 fPnt

j¼1 xi2j � 1

ntðPnt

j¼1 xijÞ2g,i 2 f1; 2g [39], which can be updated with a constanttime at each t. The pilot bandwidth is used to updatethe estimation of the second derivative of the datamarginal distribution on dimension i. The roughnessRðf 00

i Þ is then computed to estimate the bandwidthvalue on that dimension.

After receiving a point xxt, the density at a resam-pling pointmms at time t is updated using sample-pointestimator [40]

ftðmmsÞ ¼ nt � 1

ntft�1ðmmsÞ þ 1

nth1t h2t

Kh

�mms; xxt

�; (10)

where Kh is defined in Eq. (3). It is straightfor-

ward to show that the updated density ftðmmsÞ is agood approximation to the estimated density

using all the nt points. In particular, since ftðxxÞ ¼1nt

Ptj¼1

1h1j h2j

Khjðxx; xxjÞ � 0 and 8 j the integration

over xx of 1h1j h2j

Khjðxx; xxjÞ ¼ 1, averaging the inte-

grations of the two terms in (10) results in 1.(ii) When nt > w. In this case, the pilot bandwidth is cal-

culated on the most recently received w points inside

the window as follows: hit ¼ csitw�1=5, where the

sample variance s2itof the projected data on the ith

dimension can be easily updated by

s2it¼ 1

w� 1

Xtj¼t�wþ1

x2ij �

1

w

Xtj¼t�wþ1

xij

!20@

1A:

The pilot bandwidth is used to update the estimationof the second derivative values and to compute thebandwidth hit , which is used to update the density

2. This first batch of data is used for initializing the resampling pointsand setting bandwidth values. KDE-Track is not sensitive to how manypoints are used in this batch, as the resampling model and bandwidthare updated online with new arriving data after initialization.

3. The initial model is defined such that the distance between anytwo consecutive points on xi-axis is ~hi1 , where ~hi1 is the pilot band-width estimated using the first batch of data points.


function at the resampling points. The PDF values at

the resampling points ftðmmsÞ are updated by absorb-ing the new arrived point xxt and deleting the oldpoint that was moved out from the window

ftðmmsÞ ¼ ft�1ðmmsÞ þKht

mms; xxtð Þwh1t h2t

�Khtmms; xxt�wð Þ

wh1t�wh2t�w

: (11)

The probabilistic properties of updated density func-

tion ftðxÞ can be proved as:

(1) ftðxxÞ � 0; 8 xx, due to the fact that ftðxxÞ ¼1w

Ptj¼t�wþ1

1h1j h2j

Khjxx; xxj

� �is a summation of

nonnegative terms;

(2) the integrationR1�1 ftðxxÞdx1dx2 ¼ 1. Since the

integration of ðh1j h2jÞ�1Khjxx; xxj

� � ¼ 1 for any j,

averaging w terms will also be 1 .When updating the resampling model, we should con-

sider the changes in the regions of the PDF with high curva-ture as the density function is changing over time.Furthermore, when updating the model, we consider extend-ing and shrinking the resampling model to cover the densityfunction support. When data from a new distribution arrive,we extend the resampling model to cover the range of thedata. As a result of the evolution in data streams, distribu-tionsmight disappear from the stream so that the resamplingpoints in the regions where no data samples were receivedfor long time interval should be removed. As the changes inthe density function curved regions cannot be observed afterreceiving a single data sample, we are checking for suchchanges after receiving a predefined number of points, e.g.,5� 20% of the slidingwindow size.

5.4 Time and Space Complexity Analysis

Based on the discussion earlier in this section, the time com-plexity of estimating the density for a new incoming datapoint is OðU1 þ U2Þ, where U1 and U2 were given in Section4.1. They are independent of the number of points that havebeen received from the data stream. Updating the modelwhen receiving a new point requires computing time linearto the total number of the resampling points jMj, since all thefunction values at the resampling points are updated. Theoverall time complexity of processing each arriving point islinear to themodel size,which is usually a limited small num-ber. The time required for online density estimation of a datastream with n points is OðnjMjÞ, which linearly increaseswith the number of received points from the stream.

Bandwidth selection requires maintaining a one dimen-sional KDE-Trackmodel on each dimension. Since U1 � U2 ¼jMj and U1; U2 � 1, the total number of resampling points in

both models of the one dimensional KDE-Trackis U1 þU2 � jMþ 1j, which will increase only the constant in theKDE-Track’s time complexity formula. Thus, the KDE-

Tracktime complexity isOðnjMjÞ ¼ Oðn� U1 � U2Þ.During the online density estimation process, KDE-Track

keeps the resampling modelM and the points in the slidingwindow in memory. Therefore, the memory usage is

jMjw ¼ U1 � U2 � w. Note that the model size jMj changesupon the distribution variation in data streams due tomerge/split operations in adaptive resampling.

5.5 Multidimensions

Extending the two dimensional KDE-Track to higher dim-ensions is straightforward. The same technique can be usedfor selecting the bandwidth using the marginal distributionof the data on each dimension. The KDE-Track model forestimating the density of d-dimensional data can be con-structed as follows: 1) discretize the range of the data on the

ith dimension, with 1 � i � d, using a set of points Ui;2) define the set of resampling points as the cartesian prod-

uct U1 � U2 � � � � � Ud; and 3) estimate the density functionvalues at the set of resampling points and store them withtheir estimated density in the modelM.

The product kernel defined in Eq. (3) will be

Kh xx; xxj

� � ¼Ydi¼1

1

hiK

xi � xjihi

� �� :

Researchers avoid to use the product kernel for the case ofhigh dimensional data and replace it with an orientation-invariant kernel function

Kh xx; xxj

� � ¼ 1

hK

kxx� xxjkh

� �;

which may not be able to estimate densities with arbitraryshapes as it assumes equal variance values of the data oneach dimension.

The interpolation error for the case of d-dimensional data

can be bounded as follows: letDim be the maximum distance

between the resampling points in dimension i; i 2 f1; 2; . . . ;dg and Dm ¼ maxfDi

m; 1 � i � dg. Then we can have theerror as

~fðaaÞ ¼ fðaaÞ þD2m

8

Xdi¼1

fxixiðaaÞ( )

þOpðD3mÞ:

This error is reducible by including more resampling pointsin certain regions.

The KDE-Track’s time complexity will remain linear inthe size of the stream OðnjMjÞ. However, jMj ¼ jU1j �jU2j � � � � � jUdj will increase with the number of dimen-

sions d and the number of resampling points jUij on eachdimension i.

6 PERFORMANCE EVALUATION

In order to evaluate our method, we have run extensivenumber of experiments on synthetic data. Here we reportthe results for one 1D stream and one 2D stream. Since thetrue densities are known, we evaluate the performance ofthe density estimators by comparing the estimated densitieswith the true ones. We compare KDE-Track with manybaseline methods in terms of estimation accuracy and run-ning time. The accuracy of the different methods is mea-sured by the Mean Absolute Error (MAE) and the l1 error.

6.1 Estimation Accuracy on Synthetic Data

6.1.1 Datasets

The one-dim stream (S1D) was generated by extracting datasamples from the fifteen densities suggested by Marron andWand [41] and presented in Fig. 3. The stream is


constructed by extracting 3� 104 data samples from eachdensity and concatenating the batches to get one stream of

4:5� 105 data samples. The two-dim data stream (S2D) is

generated by extracting data segments of size 105 from theseven densities presented in Fig. 4. The total size of the

stream is 7� 105 data samples. These streams are selectedbecause they contain challenging densities that are hard tobe estimated accurately. Using batches of the same size is tosimplify the calculation of the true density only.

6.1.2 Bandwidth Selection

The first experiment evaluates the proposed bandwidthselection method. We compare the accuracy of KDE-Trackwhen using different bandwidth selection methods to

estimate the density of S1D. The baseline methods include1) the normal rule, which estimates the bandwidth byassuming the underlying distribution is normal; 2) the ana-lytical bandwidth, which is computed by Eq. (5) with Rðf 00Þanalytically derived from the true density (known in syn-thetic data) and 3) Shimazaki’s method [34]. Note thatSheather-Jones method [29] is not compared due to its qua-dratic complexity w.r.t. window size. We had to stop itsrunning after one month without any results.

The different bandwidth selection methods are evaluatedin terms of the MAE and the running time. The MAE, for agiven window size, is computed as follows: i) define a set ofevaluation checkpointsC ¼ fc1; c2; . . .g, where the occurrenceof cj; cjþ1 are separated by receiving 1,000 samples from thestream; ii) at each checkpoint cj, generate an evaluation set

Fig. 3. The 15 densities recommended by Marron and Wand [41] to evaluate univariate density estimators.

Fig. 4. The contours of the densities used to construct the 2D data stream that is used to evaluate the density estimators.


E ¼ fe1; e2; . . . ; e1;000g of 1,000 data samples from the samedistribution of the data in the sliding window; iii) computethe MAEcj by averaging the difference between the true and

the estimated density values 11;000

P1;000k¼1 jfðekÞ � ~fðekÞj; iv)

compute MAEw ¼ 1jCjPjCj

j¼1 MAEcj , where jCj ¼ n=1; 000 is

the number of checkpoints, n is the size of the stream andw is the window size; v) the experiment is repeated for 20instances of S1D and the average of the MAEw is reported

in Fig. 5. The window size changes from 1� 104 to 3� 105.Note that, at every checkpoint cj, the set of evaluationpoints differs from the set of evaluation points at check-point cj�1 and all the methods are evaluated using thesame set of evaluation points.

First, we see that the estimation error decreases withincreased window size w by including more data in the esti-mation. However, the MAE does not keep decreasingnoticeably, because a large window contain more difficultdensities to estimate. Second, our proposed method hasalways comparable estimation error with Shimazaki’s andthe analytical method, but is much more efficient. When wincreases, our method is continuously efficient, whileShimazaki’s and the analytical method take more time, dueto the heavy calculations for approximating Rðf 00Þ. The nor-mal rule is most efficient but has much higher error.

6.1.3 Estimation Accuracy

The estimation accuracy of KDE-Track is compared with theaccuracy of four baseline methods (except CK for S2D dataas CK is proposed to estimate the density for univariatedata only). The baseline methods are: 1) the traditional KDE[14] defined in Eq. (1); 2) the FFT-KDE [13], [14], whichdeploys FFT to convolve a very fine histogram of the datawith a kernel function to produce a continuous densityfunction; 3) the Cluster Kernels [7], which maintains a spe-cific number of kernels by merging similar kernels; and 4)SOMKE [9], which employs SOM to cluster the data into aspecific number of clusters and uses the centroids of theclusters as the set of kernels.

Selecting the bandwidth values for each estimator isdone using the same settings as in the references. All thebaseline methods use the normal rule because of its effi-ciency, except the CK method, which uses the Epanechni-kov kernel function with a recommended constant c ¼ 1:06.This setting enables CK to perform well when densitieshave high peaks and are multimodal. KDE-Trackuses our

proposed method for setting the bandwidths, i.e., estimat-ing the roughness of the second derivative Rðf 00Þ and plug-ging it in Eq. (5), which increases its accuracy significantly.

The performance of all the estimators is evaluated by theMAE and the l1 error. The MAE measures how the esti-mated density curve fits the curve of the true density, whilethe l1 measures the maximum variation between the trueand the estimated curves. The error is computed bydefining a set of checkpoints with step of 1,000. For each esti-mation method, at each checkpoint an evaluation setE ¼ fee1; . . . ; ee1;000g of 1,000 samples is generated from thesame distribution of the data in the sliding window. The trueand the estimated density values of the evaluation points arethen compared to compute theMAE and the l1 error.

The CK and FFT-KDE methods are not designed to cap-ture the dynamic density of the data streams using the slid-ing window approach. To adapt these methods with slidingwindows, we rebuild their model at each evaluation check-point by deleting the old model and creating a new modelusing the data samples in the current window. This adapta-tion preserves the estimation accuracy of the methods. How-ever, the CK method is shown to be impractical for onlinedensity estimation due to its high computational cost, as we

Fig. 5. (a) The MAE and (b) the running time of different bandwidthselection methods with various window sizes.

Fig. 6. (a) The MAE and (b) the l1 error (b) incurred by the differentdensity estimators when estimating the density of the S1D stream. Thewindow size is 2� 104.


will show later in Section 6.2. SOMKE is adapted for the caseof sliding window by dividing the sliding window intobatches of 1,000 samples. At each evaluation checkpoint, thekernels that represent the removed batch out of the slidingwindow are deleted and replaced by the kernels that repre-sent themost recent batch added to the slidingwindow.

Figs. 6 and 7 show the MAE and the l1 error incurred bythe evaluated methods when estimating the density of S1Dand S2D streams, respectively. The window size is set to

2� 104 data samples. The results show that KDE-Track hasthe best performance (the smallest error). The high accuracyobtained by KDE-Track is mainly because of our accuratebandwidth selection method. KDE, FFT-KDE and SOMKEshow comparable results as they use the same bandwidthselection method.

Fig. 8 shows the estimation error in terms of MAE for thedifferent estimators when estimating the densities of S1Dand S2D with different sliding windows. The sliding win-dow’s size changes from 1� 104 to 3� 105. For large win-dows, the density estimation becomes more accurate, whichis reflected by smaller MAE values. However, the decreasein the MAE is not as expected because larger windowsinclude data from different densities, which complicates thedensity estimation. As KDE-Track, CK, SOMKE and FFT-KDE are approximations of the KDE, they are supposed to

have comparable results with KDE if not worse; however,the estimation accuracy depends on the bandwidth selec-tion method. KDE-Trackis shown to have the most accurateresults. KDE, FFT-KDE and SOMKE have comparableresults. In addition to the MAE, Fig. 8 shows the standarddeviation for the sensitivity analysis of the window size,where KDE-Track is the most accurate (with the lowesterror) and most stable (with the smallest standard deviationin error), especially in the S2D streams.

We also evaluate the number of merges/splits in ourresampling model, which contributes on the reduction ofestimation error and running time. Table 2 reports thenumber of intervals that were merged/split in the twostreams S1D and S2D, when the window size w changes.The frequency of merge/split depends on the dynamicsof the density. When w is small, the changes in the den-sity are more observable and cause more updates of inter-vals. When w is larger, more complex densities areexpected but change slowly, and thus require lessupdates of intervals. In our S1D and S2D streams, thedensities become more complex with time as shown inFigs. 3 and 4, and thus require more splits than mergesespecially for large windows.

Fig. 7. The MAE (a) and the l1 error (b) incurred by the different densityestimators when estimating the density of the S2D stream. The windowsize is 2� 104. Fig. 8. The MAE incurred by the different density estimators when esti-

mating the density of the S1D stream (a) and S2D stream (b). The win-dow size varies from 1� 104 to 3� 105.


6.2 Computational Time Cost and Space Usage

Other important factors in the success of an online densityestimator are its running time and space usage, as streamingdata arrive fast and have unlimited size. Since we are esti-mating the dynamic density, which will be better repre-sented by the most recent samples, all the methods aremodified to use sliding window technique. This techniquerequires storing the samples in the sliding window in thememory either for using them to estimate the density as inKDE or to update the density estimator’s model as in CK,FFT-KDE, SOMKE and KDE-Track. Hence, all the methodshave comparable space complexity, which is linear in thewindow size.

The time complexity of KDE-Track, as discussed in Sec-tion 5.4, is OðnjMjÞ. Estimating the density using KDE atany given data sample requires scanning the sliding win-dow. Therefore, the time complexity of KDE when used foronline density estimation is OðnwÞ, where w is the windowsize. The time complexity of CK is controlled by two mainsteps: 1) model reconstruction, which is performed at eachevaluation checkpoint and has a complexity of OðwÞ; 2) den-sity estimation at any sample of the evaluation points,which has a constant time complexity. Thus, using CK foronline density estimation will have a complexity of OðnwÞ.However, the model’s reconstruction step of CK is more

expensive than the density estimation using all the kernelsin KDE. It is expected that CK will be more timely efficientif the data is stationary and the model is updated onlinewithout reconstruction.

FFT-KDE also has two main steps: 1) model reconstruc-tion, which involves updating the histogram after receiv-ing a new data sample and convolving the histogram withkernel function; and 2) density estimation of the evalua-tion samples. The first step requires OðB logBÞ, where B isthe number of bins in the histogram, and the second stephas a constant time complexity. The time complexity isthus OðnB logBÞ.

The SOMKE model is built by training the SOM neuronswith the current window which has a time complexity ofOðwÞ. Estimating the density at the evaluation samplesusing the trained SOM neurons has a constant time com-plexity. The method’s time complexity is then OðnwÞ, wherethe constant in the complexity formula is smaller than thatfor KDE and CK. Fig. 9 shows the running time for usingthe density estimators for online density estimation of S1Dand S2D streams. The results in the figure confirm our

Fig. 10. The density estimated using the New York Taxi trips data for different time intervals.

TABLE 2The Total Number of Intervals Merged/Split WhenEstimating the Density of S1D and S2D Streams

with Different Window Size

Window Size 10K 20K 50K 100K 200K 300K

M (S1D) 8 6 2 0 0 0S (S1D) 51 49 45 39 38 38

M (S2D) 5 4 0 0 0 0S (S2D) 47 45 45 44 42 42

(M=Merge, S=Split).

Fig. 9. The running time of the different density estimators when estimat-ing the density of the (a) S1D and (b) S2D streams. The window sizevaries from 1� 104 to 3� 105.


analysis.4 KDE-Track and FFT-KDE are most efficient withvery small running time, which is not affected by the size ofw.

6.3 New York Taxi Trips Data Example

One of the main advantages of using KDE-Track for densityestimation is the availability of the density function valuesat the set of resampling points at any time point. This can beused for visualizing the density function in real time with-out any further processing. Density estimation based visual-ization is preferable over scatterplots, the most prominentsuccess stories in statistics and visualization, since scatter-plots are challenged by overdraw and cluttering in the caseof large datasets [42]. Applications such as change diagnosisof data streams [43] will benefit from using KDE-Track byvisualizing the density velocity upon the arrival of a newsample from the stream instead of using disjoint time inter-vals, which may miss critical changes that occur at the endof the different intervals. Service planners also can benefitfrom monitoring the density by forwarding more serviceproviders to regions that demand more services at a specifictime. For example, monitoring the density of taxi pickupdata can tell the planners of taxi companies to forwardmore taxicabs to a specific region of the city.

In this section, we visualize the dynamic traffic distribu-tion in the New York Taxi trips dataset.5 The dataset isfreely available and contains records of trips that includepickup time, longitude and latitude of the pickup and dropoff location, etc. We are mainly interested in the pickuptime and location. Fig. 10 shows the density estimated usingthe pickup location with window size of 104 data points,where the data records are sorted according to their pickuptime. The first three figures show the pickup events occur-ing in the early morning of a weekend day (figure a), of aregular working day (figure b) and of a national holiday(figure c). These figures show more pickup events duringthe weekends and holidays than during regular workingdays in the Greenwich and the East villages where there aremany restaurants and nightclubs. The frequency of pickupevents also increases during the weekends as it took less

than 30 minutes to record 104 events in a weekend but morethan 3 hours in the early morning of a working day. Moretaxicabs are thus suggested in that region on similar eventsto satisfy the high demand.

Interesting patterns of community behavior can also befound in a regular working day. Figs. 10e, 10f, and 10g showthe pickup events on November 7 and 8, 2013 at differenttime intervals. The pickup events during the working hours(figure e) show close to uniform distribution within the areaaround the Central Park. Fig. 10f shows high density at theLincoln Center during the time interval 21:48-22:55, when aconcert or other events may be over. After midnight, we canobserve a small number of pickup events occurred as it tooksix hours to accumulate 104 events with more pickup eventsoccurred around Trump and Freedom towers.

Figs. 10d and 10h show the density estimated using 3Ddata (longitude, latitude and trip distance). The densityfunction is colored in blue/red for low/high density

regions. Fig. 10d shows that most of the trips during work-ing hours are short trips for people to move within theisland, whereas there is an increase in the number of longtrips during the early morning as a result of the unavailabil-ity of public transportation (Fig. 10h shows larger blue ren-dered volume than Fig. 10d at large Distance values). Notethat these snapshots are provided as examples only whileKDE-Track provides an online visualization for the densityfunction.6 Similar patterns of the density function arerepeated over time with minor changes. Such patterns notonly are useful in planning better services but also providecritical information to reduce social and environmentalcosts in the transportation systems.

7 CONCLUSION

In this paper, we studied the problem of estimating thedynamic density that comes with evolving spatiotemporaldata streams. We proposed the KDE-Track method totimely track the evolving distribution and accurately esti-mate the probability density function of these data streams.The density function is available at any time point for visu-alizing and monitoring the data streams. The effectivenessand efficiency of KDE-Track have been analytically studiedand experimentally demonstrated on both synthetic datastreams and real-world New York Taxi trips data. In ourfuture work, we will study the application of KDE-Track fordata stream clustering. In [44], streaming data were mappedinto a discretized density grid, which is similar to ourresampling model but used for recoding mapped character-istics. Clustering was performed off-line by connectingneighboring dense grids. We will investigate the usage ofestimated density for clustering, since ‘dense’ areas areexplicitly indicated by the estimated densities.

ACKNOWLEDGEMENTS

This work was supported by the King Abdullah Universityof Science and Technology.

REFERENCES

[1] A. Zhou, Z. Cai, L. Wei, and W. Qian, “M-kernel merging:Towards density estimation over data streams,” in Proc. 8th Int.Conf. Database Syst. Adv. Appl., 2003, pp. 285–292.

[2] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki,and D. Gunopulos, “Online outlier detection in sensor data usingnon-parametric models,” in Proc. 32nd Int. Conf. Very Large DataBases, 2006, pp. 187–198.

[3] B. Schaller, “A regression model of the number of taxicabs in U.S.cities,” J. Public Transp., vol. 8, pp. 63–78, 2005.

[4] Z. Zhou and D. Matteson, “Predicting ambulance demand: A spa-tio-temporal kernel approach,” in Proc. 21st ACM SIGKDD Int.Conf. Knowl. Discovery Data Mining, 2015, pp. 2297–2303.

[5] F. Wu, Z. Li, W. Lee, H. Wang, and Z. Huang, “Semantic annota-tion of mobility data using social media,” in Proc. 24th Int. Conf.World Wide Web, 2015, pp. 1253–1263.

[6] A. P. Boedihardjo, C. Lu, and F. Chen, “A framework for estimat-ing complex probability density structures in data streams,” inProc. 17th ACM Conf. Inf. Knowl. Manage., 2008, pp. 619–628.

4. All implementations were coded by C/C++ and run on Intel 2.5GHz Dual-Core PC with 4 GB memory.

5. Available at: http://www.andresmh.com/nyctaxitrips/

6. Sample videos of visualization are available at* https://youtu.be/YvJZ2aeyLq4 (Global view for 2-D density oftaxi pickup event in the Manhattan Island)

* https://youtu.be/jq37IRdBUI0 (More detailed view of 2-D densityaround the Central Park)

* https://youtu.be/d4n09DYz-o8 (Estimated density using 3-D NYTaxi data)


[7] C. Heinz and B. Seeger, “Cluster kernels: Resource-aware kerneldensity estimators over streaming data,” IEEE Trans. Knowl. DataEng., vol. 20, no. 7, pp. 880–893, Jul. 2008.

[8] D. Scott, Multivariate Density Estimation: Theory, Practice, and Visu-alization. Hoboken, NJ, USA: Wiley, 1992.

[9] Y. Cao,H.He, andH.Man, “SOMKE:Kernel density estimation overdata streams by sequences of self-organizing maps,” IEEE Trans.Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1254–1268, Aug. 2012.

[10] Y. Zheng, J. Jestes, J. Phillips, and F. Li, “Quality and efficiency inkernel density estimates for large data,” in Proc. ACM SIGMODInt. Conf. Manage. Data, 2013, pp. 433–444.

[11] C. Procopiuc and O. Procopiuc, “Density estimation for spatialdata streams,” in Proc. 9th Int. Conf. Advances Spatial TemporalDatabases, 2005, pp. 109–126.

[12] M. C. Jones, “Discretized and interpolated kernel densityestimates,” J. Amer. Statist. Assoc., vol. 84, pp. 733–741, 1989.

[13] M. Wand, “Fast computation of multivariate kernel estimators,” J.Comput. Graph. Statist., vol. 3, pp. 433–445, 1994.

[14] B. Silverman, Density Estimation for Statistics and Data Analysis.London, U.K.: Chapman and Hall, 1986.

[15] A. Gary and A. Moore, “Nonparametric density estimation:Toward computational tractability,” in Proc. SIAM Int. Conf. DataMining, 2003, pp. 203–211.

[16] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom,“Models and issues in data stream systems,” in Proc. 21st ACMSIGMOD-SIGACT-SIGART Symp. Principles Database Syst., 2002,pp. 1–16.

[17] X. Zhang, C. Furtlehner, C. Germain-Renaud, and M. Sebag,“Data stream clustering with affinity propagation,” IEEE Trans.Knowl. Data Eng., vol. 26, no. 7, pp. 1644–1656, Jul. 2014.

[18] A. Kogure, “Effective interpolations for kernel densityestimators,” J. Nonparametric Statist., vol. 9, pp. 165–195, 1998.

[19] C. Lin, J. Wu, and C. Yen, “A note on kernel polygons,” Biometrika,vol. 93, pp. 228–234, 2006.

[20] J. Fan and J. S. Marron, “Fast implementations of nonparametriccurve estimators,” J. Comput. Graph. Statist., vol. 3, pp. 35–56, 1994.

[21] T. Hart and P. Zandbergen, “Kernel density estimation and hot-spot mapping: Examining the influence of interpolation method,grid cell size, and bandwidth on crime forecasting,” Policing: Int.J. Police Strategies Manage., vol. 37, no. 2, pp. 305–323, 2014.

[22] V. A. Epanechnikov, “Non-parametric estimation of a multivari-ate probability density,” Theory Probability Appl., vol. 14, pp. 153–158, 1969.

[23] B. Turlach, “Bandwidth selection in kernel density estimation: Areview,” CORE Institut de Statistique, vol. 19, no. 4, pp. 1–33, 1993.

[24] J. D. Habbema, J. Hermans, and K. van den Broek, “A stepwisediscrimination analysis program using density estimation,” inProc. Comput. Statist., 1974, pp. 101–110.

[25] R. Duin, “On the choice of smoothing parameters for Parzen esti-mators of probability density functions,” IEEE Trans. Comput.,vol. C-25, no. 11, pp. 1175–1179, Nov. 1976.

[26] M. Rudemo, “Empirical choice of histograms and kernel densityestimators,” Scandinavian J. Statist., vol. 9, pp. 65–78, 1982.

[27] A. Bowman, “An alternative method of cross-validation for thesmoothing of density estimates,” Biometrika, vol. 71, pp. 353–360,1984.

[28] D. Scott and G. Terrell, “Biased and unbiased cross-validation indensity estimation,” J. Amer. Statist. Assoc., vol. 82, pp. 1131–1146,1987.

[29] P. Hall, S. Sheather, M. Jones, and J. Marron, “On optimal data-based bandwidth selection in kernel density estimation,” Biome-trika, vol. 78, pp. 263–269, 1992.

[30] M. Woodroofe, “On choosing a delta-sequence,” Ann. Math. Stat-ist., vol. 41, pp. 1665–1671, 1970.

[31] D. W. Scott, R. A. Tapia, and J. R. Thompson, “Kernel density esti-mation revisited,” Nonlinear Anal.: Theory Methods Appl., vol. 1,pp. 339–372, 1977.

[32] P. Hall and J. Marron, “Estimation of integrated squared densityderivatives,” Statist. Probability Lett., vol. 6, pp. 109–115, 1987.

[33] M. Jones, “The roles of ISE and MISE in density estimation,” Stat-ist. Probability Lett., vol. 12, pp. 51–56, 1991.

[34] H. Shimazaki and S. Shinomoto, “Kernel bandwidth optimizationin spike rate estimation,” J. Comput. Neuroscience, vol. 29, pp. 171–182, 2010.

[35] Y. Zheng and J. Phillips, “l1 error and bandwidth selection forkernel density estimates of large data,” in Proc. 21st ACM SIGKDDInt. Conf. Knowl. Discovery Data Mining, 2015, pp. 1533–1542.

[36] F. Oyegue, S. Ogbonmwan, and V. Ekhosuehi, “On the efficiencyof second-order d-dimensional product kernels,” in Transactionson Engineering Technologies, 1st ed., H. Kim, M. Amouzegar, andS. Ao, Eds. Dordrecht, The Netherlands: Springer, 2015, pp. 31–40.

[37] M. Lichman and P. Smyth, “Modeling human location data withmixtures of kernel densities,” in Proc. 20th ACM SIGKDD Int.Conf. Knowl. Discovery Data Mining, 2014, pp. 35–44.

[38] A. Qahtan, X. Zhang, and S. Wang, “Efficient estimation ofdynamic density functions with an application to outlierdetection,” in Proc. 21st ACM Int. Conf. Inf. Knowl. Manage., 2012,pp. 2159–2163.

[39] T. Chan, G. Golub, and R. LeVeque, “Algorithms for computingthe sample variance: Analysis and recommendations,” Amer. Stat-istician, vol. 37, pp. 242–247, 1983.

[40] R. Sain, “Multivariate locally adaptive density estimation,” Com-put. Statist. Data Anal., vol. 39, pp. 165–186, 2002.

[41] J. Marron and M. Wand, “Exact mean integrated squared error,”Ann. Statist., vol. 20, pp. 712–736, 1992.

[42] O. D. Lampe and H. Hauser, “Interactive visualization of stream-ing data with kernel density estimation,” in Proc. IEEE Pacific Vis.Symp., 2011, pp. 171–178.

[43] C. C. Aggarwal, “A framework for diagnosing changes in evolv-ing data streams,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,2003, pp. 575–586.

[44] Y. Chen and L. Tu, “Density-based clustering for real-time streamdata,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery DataMining, 2007, pp. 133–142.

Abdulhakim Qahtan received the BSc degree incomputer science from Cairo University, Egypt,in 2000, the MSc degree in information and com-puter science from the King Fahd University ofPetroleum and Minerals, Saudi Arabia, in 2008,and the PhD degree in computer science fromthe King Abdullah University of Science & Tech-nology (KAUST), Saudi Arabia, in 2016. His mainresearch interests include data stream mining,machine learning, and statistics.

Suojin Wang received the PhD degree from theUniversity of Texas at Austin. He is a professor ofstatistics and epidemiology and biostatistics withTexas A&M University. His research interestsinclude semi- and non-parametric statistical meth-odology, and biostatistical applications. He was theeditor-in-chief of the Journal of Nonparametric Sta-tistics during 2007-2012. He is an elected fellow ofthe American Statistical Association and the Insti-tute of Mathematical Statistics. He is an electedmember of the International Statistical Institute.

Xiangliang Zhang recevied the PhD degree incomputer science from INRIA-Universit�e Paris-Sud, France, in 2010. She is currently an assis-tant professor and directs the Machine Intelli-gence and Knowledge Engineering (MINE)Laboratory, King Abdullah University of Scienceand Technology (KAUST), Saudi Arabia. Shewas an European ERCIM research fellow withthe Norwegian University of Science and Tech-nology, Norway, in 2010. Her main researchinterests and experiences include machine learn-ing, data mining, and cloud computing.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


642 ieee transactions on knowledge and data … · kde-track: an efﬁcient dynamic density...

Documents