a fast density-based data stream clustering algorithm with ...ivsn-group.com/article/chenjinyin/a...

Information Sciences 345 (2016) 271–293

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier.com/locate/ins

A fast density-based data stream clustering algorithm with

cluster centers self-determine d for mixe d data

Jin-Yin Chen

∗, Hui-Hao He

College of Information Engineering, Zhejiang University of Technology, Hangzhou, China

a r t i c l e i n f o

Article history:

Received 20 March 2015

Revised 18 November 2015

Accepted 27 January 2016

Available online 3 February 2016

Keywords:

Data mining

Mixed attributes

Data stream clustering

Peak field intensity

Mixed distance measure metrics

a b s t r a c t

Most data streams encountered in real life are data objects with mixed numerical and cate-

gorical attributes. Currently most data stream algorithms have shortcomings including low

clustering quality, difficulties in determining cluster centers, poor ability for dealing with

outliers’ issue. A fast density-based data stream clustering algorithm with cluster centers

automatically determined in the initialization stage is proposed. Based on data attribute

relationships analysis, mixed data sets are filed into three types whose corresponding dis-

tance measure metrics are designed. Based on field intensity-distance distribution graph

for each data object, linear regression model and residuals analysis are used to find the

outliers of the graph, enabling cluster centers automatic determination. After the cluster

centers are found, all data objects can be clustered according to their distance with cen-

ters. The data stream clustering algorithm adopts an online/offline two-stage processing

framework, and a new micro cluster characteristic vector to maintain the arriving data ob-

jects dynamically. Micro clusters decay function and deletion mechanism of micro clusters

are used to maintain the micro clusters, which reflects the data stream evolution process

accurately. Finally, the performances of the proposed algorithm are testified by a series of

experiments on real-world mixed data sets in comparison with several outstanding clus-

tering algorithms in terms of the clustering purity, efficiency and time complexity.

© 2016 Elsevier Inc. All rights reserved.

1. Introduction

As one of the most important techniques in data mining, clustering is to partition a set of unlabeled objects into clus-

ters where the objects fall into the same cluster have more similarities than others [18] . Clustering algorithms have been

developed and applied to various fields including text analysis, customer segmentation, gene engineering, etc. They are also

useful in our daily life since massive data with mixed attributes are now emerging. Typically these data contain both nu-

meric and categorical attributes [11] . For example, the applicant for a credit card involves data of age (integers), income

(float), marital status (categorical), etc., forming a typical example of data with mixed attributes. Up to now most researches

on data clustering have been focusing on either numeric or categorical data instead of both. Examples include BIRCH [24] ,

k -modes [16] , fuzzy K -modes [17] , BFCM [21] , TCGA [14] , fuzzy k -modes [5] and k -means based method [22] . Those methods

as [4,8,12,13,15,19,20,23,27,30] face problems when clustering data stream with mixed attributes while the data in stream is

emerging very quickly.

∗ Corresponding author. Tel.: + 086 13666611145.

E-mail address: [email protected] , [email protected] (J.-Y. Chen).

http://dx.doi.org/10.1016/j.ins.2016.01.071

0020-0255/© 2016 Elsevier Inc. All rights reserved.


http://www.ScienceDirect.com

http://www.elsevier.com/locate/ins

http://crossmark.crossref.org/dialog/?doi=10.1016/j.ins.2016.01.071&domain=pdf

mailto:[email protected]

mailto:[email protected]


272 J.-Y. Chen, H.-H. He / Information Sciences 345 (2016) 271–293

Distance measure metric for numerical values only cannot capture the distance among data with mixed attributes. Also,

the representative of a cluster with numerical values is often defined as the mean of the cluster, which however is not

possible for other attributes. To deal with this problem algorithms [8,15,20,23,27] have been proposed, most of which are

based on partition. These algorithms firstly obtain a set of disjoint clusters and then refine them to minimize a predefined

criterion function. The objective is to maximize the intra-cluster connectivity or compactness while minimizing inter-cluster

connectivity. However, most partition clustering algorithms are sensitive to the initial number of clusters which is yet dif-

ficult to determine without prior knowledge. They are also more suitable for spherical distribution data and cannot handle

outliers.

Inspired by the density-based data clustering, we propose a novel self-adaptive peak density clustering algorithm for data

with mixed attributes (ACC-FSFDP). An efficient distance evaluation method is designed based on data types determined in

prior. Mixed data sets are filed into three types including numeric dominant data, categorical dominant data and balanced

data. Then corresponding distance measure metrics are designed. Based on theory analysis on data field intensity of cluster

center and data objects, filed intensity peaks represent the cluster center that always have larger distance from objects

with higher field intensity. After the cluster centers have been found, each remaining data object is assigned to the same

cluster as its nearest neighbor of higher field intensity, and then final clustering result would be precisely concluded. We

also proposed an algorithm extends ACC-FSFDP to data streaming, called Str-FSFDP.

The main contributions of this work are:

1. A novel distance metric for mixed data stream is designed according to data attributes relationships. The ACC-FSFDP

algorithm is adopted for initialization, in which lustering centers are determined automatically through two steps: (a)

regression analysis techniques is applied for fitting the relationship of field intensity and distance of every data object,

(b) residual analysis is used to determine the centers.

2. In Str-FSFDP, a new micro cluster characteristic vector is introduced to maintain the arriving mixed data objects dynami-

cally. The frequency histogram is adopted to record the categorical attributes, and the mean value of numerical attributes

and the maximum frequency of categorical attributes are used to present the center of micro clusters.

3. An online/offline framework is adopted for Str-FSFDP, in which the micro clusters decay function and deletion mechanism

of micro clusters are applied to maintain the micro cluster in the online stage, which makes Str-FSFDP more consistent

with the intrinsic characteristics of the original mixed data stream. The improved density-based method is adopted, in

which the micro clusters are considered as a virtual object and cluster them to get final clustering results.

The rest of this paper is organized as follows. Section 2 introduces background information on data clustering stream

clustering. Section 3 depicts ACC-FSFDP and Str-FSFDP in detail. Section 4 presents the simulations and analysis of ACC-

FSFDP and Str-FSFDP with other outstanding data clustering algorithms. Finally Section 5 concludes the paper.

2. Related works

In order to cluster data with mixed attributes, Huang proposed k -prototypes [15] which combine k -means and k -mode

algorithms. Considering the uncertainly character of data, KL-FCM-GM [23] extends k -prototypes algorithm. KL-FCM-GM is

an extension of Gath-Geva, which is designed for the Guss-Multinomial distributed data. EKP [30] is developed by intro-

duced based on an evolutionary algorithm framework to help k -prototypes improve global search capability. Distance-based

Agglomerative Clustering algorithm (SBAC) [8] was proposed adopting the distance measure defined by Goodall. CAVE [11] is

designed for clustering mixed data based on the variance and entropy. However, CAVE needs to build the distance hierarchy

for each categorical attribute, while the determination of distance hierarchy requires the domain expertise. Another k -means

type algorithm [4] is implemented to deal with mixed data by using co-occurrence of categorical values to calculate the sig-

nificance of attribute and the distance between categorical values. IWKM [19] is presented which combines mean value of

all distribution centroids to represent the prototypes of the cluster and takes into account the significance of each attribute

towards the clustering process. WFK-prototypes [20] combine mean value of fuzzy centroids to represent the prototypes of

the cluster and adopt the significance concepts proposed by paper [4] to extend k -prototypes. SpectralCAT [13] is brought

up for clustering numerical and nominal data. Paper [27] proposed a mixed data clustering algorithm based on a unified

distance metric without knowing cluster number in prior, and the embedded competition and penalization mechanisms are

used to determine the number of clusters automatically by gradually eliminating the redundant clusters. Other clustering

algorithms [26] introduced combine the field theory and traditional clustering algorithm to achieve clustering. A hierarchical

clustering method [26] based on data fields, which adopts the potential function to describe effect relationship between data

objects. FTSC [27] is a field-theory based spatial clustering method, in which a novel concept of aggregation force is utilized

to measure the degree of aggregation among the data objects. Alex and Alessandro [1] proposed a new approach based

on the idea that cluster centers are characterized by a higher density than their neighbors and by relatively large distance

from objects with higher densities. But their approach requires human supervision to determine the cluster centers, and the

clustering quality is sensitive to the parameter cutoff distance, in addition, the algorithm cannot deal with the mixed data.

Also a large number of approaches have been proposed for data steam clustering. CluStream [2] is designed for clustering

evolving data stream through online/offline two-stage framework for the first time. Micro-cluster structure is defined in the

online stage, which maintains the arriving data objects constantly and generates online summary information, while the

offline stage is responsible for answering users’ requests. The CluStream gets much attention because of its flexibility and

J.-Y. Chen, H.-H. He / Information Sciences 345 (2016) 271–293 273

Fig. 1. Data stream clustering mode of Str-FSFDP.

scalability. However, CluStream is sensitive to the outliers and cannot discover clusters with arbitrary shaped data. Moreover,

the precise cluster number has to be preset which would affect the original distribution of the data stream seriously. Aiming

for high dimensional data stream clustering, HPStream [3] is proposed by introducing projection and decay function to

improve performances of CluStream. Based on online/offline two-stage framework, Den-Stream [7] is consisted of potential

micro-cluster and outlier micro-cluster, which could cluster data stream of any arbitrary shapes. Den-Stream [10] adopts

globally consistent parameter which is sensitive to different applications, in which data space is partitioned into discretized

fine grids with new data objects mapped into corresponding grid. Clustering operations are carried out on the grids instead

of raw data stream to achieve a better clustering performance. Since the data object is mapped to the corresponding grid

at the cost of precise data object position information resulting in incapacity of edge girds clustering. D-StreamII [25] is

an extended version of D-Stream through grid attraction dealing with edge grids at the cost of more calculation time. AP

algorithm is extended to cluster data steam StrAP [29] on conference PKDD. If a currently arrived data object matches with

the current mode, then micro clusters are updated. Otherwise the data would be recorded as an outlier. Clustering results

are sensitive to the size of slipping window without considering the time influence of arriving data. In order to solve those

problems, the StrDenAP [28] is proposed based on the StrAP, which adopts micro cluster decay density and micro cluster

deletion mechanism to guarantee better clustering results. EXCC [6] is an exclusive and complete clustering algorithm for

heterogeneous data stream. Dynamic thresholds, wait and watch policy are adopted to detect noises in sacrifice of more

memory and processing time.

3. Str-FSFDP for mixed data stream clustering

3.1. The framework of Str-FSFDP

The proposed Str-FSFDP consists of initialization, online maintenance and offline clustering. In initialization stage, ACC-

FSFDP is applied for clustering initial N data objects to establish micro clusters. Micro clusters are updated dynamically

during the online stage, and the decay function and micro clusters deletion mechanism are used to detect evolving data

stream in real time. When a clustering request arrives in offline stage, the improved density-based method is carried out to

get the final result of clustering. On basis of initialization, online and offline cooperation clustering, mixed attributed data

stream would get fast and accurate clustering. Framework for Str-FSFDP is shown as Fig. 1.


Table 1

Real data sets from UCI.

Data set Attributes r q k Number of instances

Aggregation 2 2 0 7 788

Spiral 2 2 0 3 312

Jain 2 2 0 2 373

Flame 2 2 0 2 240

Iris 4 4 0 3 150

Breast 9 9 0 2 699

Soybean 35 0 35 4 47

Zoo 15 1 14 7 101

Acute 7 1 6 2 120

Statlog Heart 13 5 8 2 270

Credit 15 9 6 2 690

KDDCUP-99 41 34 7 / 10 0 0

r is the number of numerical attributes, q is the number of categorical

attributes, “/” is for unknown parameters and k is the number of cluster.

3.2. Distance metrics for mixed data

3.2.1. Main idea

Distance metric is prerequisite for a meaningful cluster analysis. We lists distance measure metrics for classical partition-

based clustering algorithms as follows.

(1) k -means can only deal with numerical data sets according to d( X i , X j ) =

√ ∑ m

p=1 ( X p i

− X p j )

2 , where d( X p

i , X

p j ) =

( X p i

− X p j ) 2 .

(2) k -modes [16] can only cluster categorical data sets by d( X i , X j ) =

∑ m

p=1 δ( X p i , X

p j ) , where δ( X p

i , X

p j ) = { 0 X

p i

= X p j

1 X p i

� = X p j

.

(3) k -prototypes [15] can handle mixed data sets according to distance metric defined as d( X i , Q l ) =

∑ p j=1

( X r i j

− q r i j ) 2 +

μl

∑ m

j= p+1 δ( X c i j , q c

i j ) , where d( X i , Q l ) = ( X p

i − Q

p j ) 2 and δ( X p

i , Q

p j ) = { 0 X

p i

= Q

p j

1 X p i

� = Q

p j

.

(4) EKP [30] can also handle mixed data sets by d( X i , Q l ) =

∑ p j=1

( X r i j

− q r i j ) 2 + r

∑ m

j= p+1 δ( X c i j , q c

i j ) , where numerical at-

tributes are calculated by d( X i , Q l ) = ( X p i

− Q

p j ) 2 while categorical attributes are calculated by δ( X p

i , Q

p j ) = { 0 X

p i

= Q

p j

1 X p i

� = Q

p j

.

(5) WFK-prototypes [20] can cluster mixed data sets by d( X i , Q l ) =

∑ p

l=1 ( s l ( X

r i j

− q r i j ) 2 ) +

∑ m

l= p+1 ϕ ( X c i j , v c

i j ) 2 , where

d( X i , Q l ) = s l ( X p i

− Q

p j ) 2 and categorical part is defined as ϕ ( X p

i , Q

p j ) 2 .

The Euclidean distance is adopted by k -means algorithm to handle the pure numerical data. The simple matching dis-

tance is adopted by K -modes algorithm to deal with the pure categorical data. K -prototypes integrate k -means and K -modes

to deal with mixed data. EKP and WFK-prototypes improved k -prototypes by introducing fuzzy factor and weight coefficient

in original distance measure respectively, both of which achieve more accurate similarities among objects. The algorithms

listed in Table. 1 adopt single distance metric for all mixed data sets, however the importance of each attributes in the data

set is different for the final clustering result.

For example, each object in zoo data set has been described by 1 numerical attribute and 16 categorical at-

tributes, such as legs (numerical), hair (categorical), feather (categorical) etc. Take deer {1,0,0,1,0,0,0,1,1,1,0,0,1,0,1,4}

and dolphin {0,0,0,1,0,1,1,1,1,1,0,1,1,0,1,0} belong to cluster 1 for instance, whose categorical attributes are

{1,0,0,1,0,0,0,1,1,1,0,0,1,0,1}, {0,0,0,1,0,1,1,1,1,1,0,1,1,0,1}, and numerical attributes are {4}, {0} respectively. And another ob-

ject frog {0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,4} belongs to cluster 5, whose categorical attribute is {0,0,1,0,0,1,1,1,1,1,1,0,0,0,0}, numerical

attribute is {4}. When K -prototypes, EKP are used to compute the distance of deer and dolphin d ( deer, dolphin ), distance of

deer and frog d ( deer, frog ). The categorical part and numerical part of d ( deer, dolphin ) are 4 and 16, the total distance is 20.

The categorical part and numerical part of d ( deer, frog ) are 8 and 0, the total distance is 8. The distance between deer and

frog is smaller than the distance between deer and dolphin, so that deer is more similar to frog than dolphin. But in fact,

the object deer and dolphin belong to the same cluster “1”, which is different from object frog. So the distance measure of

K -prototypes, EKP are not suitable for this type of mixed data sets.

In order to define suitable distance measure metric for all types of mixed data sets, dominant factor based dominant

analysis method is proposed, which files the mixed data sets into three types including numeric dominant data, categorical

dominant data and balanced data. Then appropriate distance metric is designed for each type specifically.


3.2.2. Dominance analysis

Let D = { X 1 , X 2 , … , X i , … , X n } denotes a set of n data objects to be clustered. Each X i is presented as X i ={ X 1

i , X 2

i , . . . , X d

i } , where d is attribute number. The d attribute consists of r numeric attributes and q categorical attributes.

We define the dominant factor α by r / d or q / d to categorize the data.

Case 1 . If r d

> α, the data set D is numeric dominant.

Case 2 . If q d

> α, the data set D is categorical dominant.

Case 3 . If 1 − α <

r d

< α or 1 − α <

q d

< α , the data set D is balanced.

We have tested massive typical mixed data sets in UCI machine learning repository and conclude the dominant factor

α = 0.75 [30] .

3.2.3. Mixed data distance metric

Take data set D as an example, a novel distance measure metric for mixed data is designed on basis of dominance anal-

ysis, where d ( X i , X j ) n and d ( X i , X j ) c are used to represent the distances of numerical attributes and the categorical attributes

respectively. The motivation of dominance analysis is to file mixed attributed data sets into three types based on the char-

acteristics analysis of mixed attributed data. For numerical dominant data or categorical dominant data, distance metrics for

two data objects is designed to amplify dominant attributes affection while reducing non-dominant attributes affection on

the whole clustering accuracy. And for balanced data, the significance of each attribute would be evaluated.

Take the data set belongs to numerical dominant data (or categorical data) for example. Its number of categorical at-

tributes (or numerical attributes) is minority, but those non-dominant attributes are very important for clustering results.

For those special situations, we adopt a semi-supervised clustering method based on labeled training data to get weight

of each attribute. The semi-supervised clustering method is described as follows. Given data set D = { X 1 , X 2 , … , X i , …X n },

where each data object X i has d dimensions, X i = { X 1 i

, X 2 i

, . . . , X d i } , which has r numeric attributes and q categorical at-

tributes. The weight ω = { ω

1 r , ω

2 r , . . . , ω

r r , ω

1 q , ω

2 q , . . . , ω

q q } is adopt ed to describe the significance of each attribute whose

initial value randomly generated from 0 to 1. Mean distance between data objects and cluster center within the same cluster

is defined as the performance evaluation criterion. PSO is adopted to update weight vector until the performance evaluation

remain unchanged. The optimal weight vector ω = { ω

1 r , ω

2 r , . . . , ω

r r , ω

1 q , ω

2 q , . . . , ω

q q } would be assigned as the weight of

each attributes. For most mixed data sets, both numerical and categorical attributes are balanced weighted by the dominant

analysis, so the weight ω = { 1 , 1 . . . , 1 } .

3.2.3.1. Correlation analysis between properties of mixed data. Most traditional distance metrics treat each attribute indepen-

dently. However the attribute is in fact associated with each other under covering inherent internal structures. For the

given data set, let A i be a categorical attribute, which could be x or y . Suppose A j is another categorical attribute. z denotes

the values subset of A j , and z c denotes the complementary set of z . Given a data object, its attribute A i has value x , then

p i ( z/x ) denotes the conditional probability that attribute A j has value z . Similarly, given a data object, its attribute A i has

value y , then p i ( z c /y ) denotes the conditional probability that attribute A j hase value belonging to z c .

Definition 1. The distance between attribute values x and y for A i with respect to attribute A j , Max d i j ( x, y ) is denoted as

Max d i j ( x, y ) = p i ( z/x ) + p i ( z c /y ) (1)

where z is the subset of A j ’s values that maximizes the quantity p i ( z/x ) + p i ( z c /y ) . Since both p i ( z/x ) and p i ( z

c /y ) are

between 0 and 1, restrict the value of Max d i j ( x, y ) between 0 and 1. Redefine Max d i j ( x, y ) as

Max d i j ( x, y ) = p i ( z/x ) + p i ( z c /y ) − 1 . 0 (2)

Eq. (2) states that the distance between values x and y of A i as a function of their co-occurrence probabilities with a set

of values of another categorical attribute A j . In case of other categorical attributes, similar metric for the pair values x and

y can be computed with respect to each of these attributes. The absolute distance between the pair of values x and y is

thereby computed as the average of all these values. The distance between x and y with respect to a numeric attributes is

calculated by discretizing the numeric attributes.

Definition 2. For a data set of m attributes, inclusive of categorical and numeric attributes which have been discretized, the

distance between two distinct values x and y of any categorical attribute A i is given by

d i ( x, y ) =

∑ d j=1 ,i � = j d

i j ( x, y )

d − 1

(3)

where d i ( x, y ) satisfies

(1) 0 ≤ d i ( x, y ) ≤ 1 .

(2) d i ( x, y ) = d i ( y, x ) .

(3) d i ( x, x ) = 0 .


The significance of each attribute defines how much contributions that attribute made for the whole data set. However,

determining the significance of attribute is task-dependent. The attributes displaying a good separation of co-occurrence of

values into different groups, play a more significant role in data clustering. In other words, the attribute playing a significant

role in clustering, provides any pair of its attribute values are well separated against all other attributes, i.e., have an overall

high value of d i j ( x, y ) for all pairs of x and y . Since it considers grouping of different categorical values, the distance of two

categorical values implies the significance of the corresponding attribute within it. To compute the significance of numeric

attributes, the numeric attribute is normalized firstly, in which the same number of interval T is chosen for all numeric

attributes. Each interval is assigned a categorical values u [1], u [2], … , u [ T ]. Thereafter, for discretized numeric attribute, the

distance of every pair of categorical values d ( u [ r] , u [ s ] ) is the same way used for categorical values. To handle the mixed

data, the significance of each attribute is analyzed for clustering. The distance measure metric for mixed data is defined as

Definition 3. Given the data set D = { X 1 , X 2 ,…, X i ,…X n }, where each data object X i has d dimensions, X i ={ A

1 i

, A

2 i

, . . . , A

r i , B 1

i , B 2

i , . . . , B

q i } , including r numeric attributes and q categorical attributes, and d = r + q . The dis-

tance D ( X i , X j ) between two data objects define as

D

(X i , X j

)= d

(X i , X j

)n

+ d (X i , X j

)c

(4)

where d ( X i , X j ) n denotes the numerical part distance of objects X i and X j . d ( X i , X j ) c denotes the categorical part distance

of object X i and X j . The final distance is the sum of the two parts. According to the analysis of Section 3.3.1 , the mixed

data with different dominant type have different correlation between properties. Therefore, we have three distance measure

metrics according to their dominant situations.

3.2.3.2. Distance measure metric for numerical dominant data. According to the dominant analysis in Section 3.2 , the at-

tributes number of numerical dominant data set has relationship as r > q . As seen in formula (4) , for each data object in

data set D , the distance of any attribute d i ( x, y ) is composed by ∑ r

j=1 ,i � = j d i j ( x, y ) and

∑ q j=1 ,i � = j d

i j ( x, y ) . Since r > q , the nu-

merical attributes have the dominant position in calculating the overall distance. In order to reduce the influence of the

non-dominant attribute on the whole clustering accuracy and improve the calculation speed, the distance d ( X i , X j ) n and

d ( X i , X j ) c are defined as

Definition 4. For any two data objects X i , X j , the distance of numerical part is

d (X i , X j

)n

=

√

r ∑

p=1

(X

p i

− X

p j

)2 (5)

Definition 5. For each categorical attribute of objects X i , X j , the distance of the p th attribute between X i , X j is

d (X

p i , X

p j

)c

=

{0 X

p i

= X

p j

1 X

p i

� = X

p j

(6)

and the distance of categorical part is

d (X i , X j

)c =

q ∑

p=1

d (X

p i , X

p j

)(7)

3.2.3.3. Distance measure metric for categorical dominant data. According to the dominant analysis in Section 3.2 , the at-

tributes number of categorical dominant data set has relationship as q > r. As formula (4) , for each data object in data set

D , d i ( x, y ) is composed by r ∑

j=1 ,i � = j d i j ( x, y ) and

q ∑

j=1 ,i � = j d i j ( x, y ) . Since q > r , the categorical attributes have the dominant posi-

tion in calculating the overall distance. Therefore, for each numerical attribute of data object X i , standardized approach is

adopted. The value of the p th attribute of X i is:

d (X

p i

)n

=

X

p i

− X

p i _ min

X

p i _ max

− X

p i _ min

(8)

where X p i _ max

represents attribute i th maximum, X p i _ min

represents attribute i th minimum.

Definition 6. The distance of numerical part between any two objects X i , X j is

d (X i , X j

)n

=

r ∑

p=1

(d (X

p i

)n

− d (X

p j

)n

)(9)

For categorical dominant data, the distance measure metric of categorical attributes is the same as Definition 5.


Fig. 2. The distribution of field intensity and distance in the data field for one single data object.

3.2.3.4. Distance measure metric for balanced data. According to the dominant analysis in Section 3.2.2 , the attributes number

of balanced data set has relationship as q ≈ r. For each data object in data set D , the distance of two data objects is the sum

of the distance of all attributes d i ( x, y ) .

Definition 7. For balanced data, the distance of object X i and X j is defined as

D

(X i , X j

)=

d ∑

p=1

d p (X i , X j

)(10)

where d p ( X I , X j ) denotes the distance of p th attribute between X i and X j .

3.3. Initialization stage for Str-FSFDP

3.3.1. Data field theory

Data field theory is inspired from the field idea in physics, which introduced the interaction of material particles and

the description method of field into the abstract data space. This theory overcomes drawbacks of traditional clustering

algorithms by considering the one-to-one effect relationship between objects. The theory adopts the potential function to

describe the more-to-one effect relationship between objects, and assumes the state of any object in data space is the result

of the joint action of all other objects. The potential function of data field is defined as follows

Definition 8. Given data set D = { x 1 , x 2 , . . . , x i , . . . , x n } in the data space � ⊆ R d , for any object x ∈ D , the field intensity of x

is described as

φ(x ) =

n

�i =1

(m i e

−( ‖ x −x i ‖

σ ) 2 )

(11)

where m i ≥ 0 is the quality of object x i (i = 1 , 2 , . . . , n ) , which represents influence degree of object x i on other objects.

Under normal circumstances, the impact of each object is equal, so m i = 1 . ‖ x − x i ‖ denotes the distance of object x and x i ,

and the distance measure metric for mixed data object is illustrated in Section 3.2.3 . σ ∈ (0 , + ∞ ) controls the interaction

between objects called impact factor. The smaller the impact factor, the smaller the influence range of single object.

In order to analysis the distribution relationship of the field intensity and distance for one single object, set quality m i = 1 .

The relationship shown in Fig. 2 concludes that when distance equals to 0.705 σ , filed intensity gets the maximum, which

illustrates that in the center of a data object, the radius of 0.705 σ on the spherical surface exists strong force pointing to

data object. And when the distance to the data object R ≈ 2 . 121 σ , the field intensity fades to 0 indicating the short range

field function. Therefore, any data object in data space is mainly effected by other objects in space within the distance

R ≤2 . 121 σ , in order to reduce the cost of computation, we redefine the potential function as

Definition 9. Given data set D = { x 1 , x 2 , . . . , x i , . . . , x n } in the data space � ⊆ R d , for any object x ∈ D , the field intensity of x

is described as

ϕ(x ) =

p

�i =1

(m i e

−( ‖ x −x i ‖

σ ) 2 )

(12)

where p is the number of data objects in space within radius R ≈ 2 . 121 σ of object x .


Fig. 3. Relationship of original distribution and ϕ–δ for data set of two dimensions. (For interpretation of the references to color in the text, the reader is

referred to the web version of this article.)

C

C

C

3.3.2. Main idea of ACC-FSFDP

ACC-FSFDP is based on assumptions that the cluster centers are surrounded by neighbors with lower field intensity while

at relatively larger distance from objects with higher field intensity [9] . Noise objects have comparatively larger distance

and smaller intensity. Therefore for a given data object i , only filed intensity and distance need to be calculated. The filed

intensity ϕ can calculated according to the formula (12) . δi is calculated by minimum distance between data object i and

any other objects with higher filed intensity.

δi = min

j: ρ j > ρi

(d i j

)(13)

For the object of highest field intensity, we conventionally take δi = ma x j ( d i j ) . Note that δi is much larger than the

typical nearest neighbor distance only for objects that are local or global maxima in the field intensity. Thus cluster centers

are recognized as objects of anomalously larger δi . This core observation could be illustrated by the simple example in

Fig. 3 . The data original distribution is shown in Fig. 3 (a), while relationship of field intensity ϕ and distance δ distribution

is shown in Fig. 3 (b). Specific corresponding relationships are illuminated by lines. Three red objects A 1, A 2, A 3 in Fig. 3 (a)

are cluster centers, while the corresponding objects A 1, A 2, A 3 in Fig. 3 (b) have relatively larger distance and larger field

intensity than other objects. In addition, three black objects B 1, B 2 and B 3 in Fig. 3 (a) are isolated, defined as noise objects.

The corresponding objects B 1, B 2, B 3 in Fig. 3 (b) have relatively large distance and smaller field intensity than other objects.

Other objects called border objects belong to one cluster.

For a given data set, field intensity ϕ and distance δ are calculated. And all data objects are sorted in ascending order by

their intensity values from ϕ 1 to ϕ a , where ϕ 1 and ϕ a represent the minimum and maximum of all ϕ, so as for distance δ.

For each data object i , qualitative relationships hold

ase 1 . If ϕ i ∈ ( ϕ b , ϕ n ) and δi ∈ ( δb , δn ) , then data object i is the cluster center.

ase 2 . If ϕ i ∈ ( ϕ 1 , ϕ a ) and δi ∈ ( δb , δn ) , then data object i is a noise object.

ase 3 . If the data object does not satisfy case 1 or 2, then data object i is a border object.

In summary only the data objects of relatively large r ϕ and larger δ would be the cluster centers. The isolated objects

have relatively larger δ and smaller ϕ are defined as noise objects.

How to determine the cluster centers automatically by finding out the data objects of relatively large r ϕ and larger δ?

Nonlinear function y = b 0 + b 1 /x is used to fit the functional relationship δ∗i

= f ( ϕ i ) of field intensity ϕ i and distance δi .

Transfer the nonlinear function y = b 0 + b 1 /x to liner function y = b 0 + b 1 x ′ firstly, where x ′ =

1 x . And then linear regression

analysis techniques are adopted to fit the functional relationship δ∗i

= f ( ϕ i ) . Residual analysis is applied to find outliers,

which are the cluster centers as expected.

After the cluster centers have been found, each remaining object is assigned to the same cluster as its nearest neighbor of

higher field intensity. The cluster assignment is performed through one-time scan, which is more efficient than partitional

clustering algorithms. ACC-FSFDP can also handle arbitrary shape clusters. Each remaining object is assigned to the same

cluster as its nearest neighbor of higher filed intensity. For example a number represents the level of field intensity for a

data object. The bigger the number, the higher the filed intensity is. The data object “4” is a cluster center and its cluster

label is 1. The cluster label of data object “3” should be the same with the nearest neighbor of higher filed intensity, so the

cluster label should be the same as the data object “4”, which is 1.

For the noise object, we do not simply define it by a noise-signal cutoff. Instead, we first find border region for each

cluster, defined as the set of objects assigned to that cluster and also being within a distance Dis from data objects belonging

to other clusters. The distance Dis is calculated by the average distance of all the distance of cluster centers. We then find


for each cluster, the object with highest field intensity within its border region. Its field intensity is denoted by ϕ b , and only

keeps the objects which field intensity larger or equal to ϕ b .

ACC-FSFDP only needs to set the impact factor σ in prior, which controls the interaction between data objects. A min-

imum potential entropy based optimization algorithm for impact factor σ is introduced in Yan et al. [26] , since the larger

the potential entropy, the more unstable the data field is. Therefore, the impact factor parameter optimization problem con-

verted to a potential entropy minimization problem. This paper reference the optimization algorithm to determine the value

of σ , and to solve the parameter sensitive issues.

3.3.3. Regression analysis to determine cluster centers

Regression analysis is a statistical method to determine interdependent relationships between two or more quantitative

variables. Four requirements involving the distribution of ε must be satisfied:

(1) The probability distribution of ε is normal.

(2) The mean of ε is zero: E(ε) = 0 .

(3) The standard deviation of ε is σ for all values of x .

(4) The set of errors associated with different values of y are all independent.

In other words, linear regression model assumptions are consistent with Gauss–Markov theorem.

Gauss–Markov theorem. In a linear regression model where the errors have expectation zero, equal variances and are

uncorrelated, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) esti-

mator.

Inference 1. Each error ε i = δ∗i

− δi of linear model δ∗ = b 0 + b 1 ∗ϕ

′ , where ϕ

′ =

1 ϕ , satisfies the N (0, σ 2 ) normal distri-

bution.

Inference 2. Each standardized error ZR E i =

ε i σ satisfies the N (0, 1 ) standard normal distribution.

Theorem 1. For each error ε i , it has a confidence interval [ ε i − σ∗Z α/ 2 , ε i + σ∗Z α/ 2 ] of confidence level 1 −α. If the error of any

data object is outside of confidence interval, the data object is the cluster center .

Proof. Let error = ε i , which is N (0, σ 2 ). Then

P =

⎧ ⎨

⎩

∣∣∣∣∣∣X̄ − μ√

σ 2

n

∣∣∣∣∣∣ ≤ Z α2

⎫ ⎬

⎭

= 1 − α

P

⎧ ⎨

⎩

−Z α2

≤ X̄ − μ√

σ 2

n

≤ Z α2

⎫ ⎬

⎭

= 1 − α

P

{− σ√

n

Z α2

≤ X̄ − μ ≤ σ√

n

Z α2

}= 1 − α

P

{X̄ − σ√

n

Z α2

≤ μ ≤ X̄ +

σ√

n

Z α2

}= 1 − α

For each one error, X̄ = ε i and n = 1. Finally, the confidence interval is

P {ε i − σ∗Z α

2 ≤ μ ≤ ε i + σ∗Z α

2

}= 1 − α

If the error of any data object is outside of confidence interval, it is called outlier which is cluster center.

3.3.4. Case verification

We take the sample data introduced in Fig. 2 as an example. Transfer the nonlinear function y = b 0 + b 1 /x to linear

function y = b 0 + b 1 ∗ x ′ , where x ′ =

1 x , to fit the functional relationship δ∗

i = f ( ϕ i ) for field intensity ϕ and distance δ shown

in Fig. 4.

Assume α = 0.05, residual analysis is adopted to fitting the curve. Most of the data objects in Fig. 4 shown as green

objects distributed around zero value randomly (within the range of [ −1.5, 1.5]), which illustrates that the nonlinear model

y = b 0 + b 1 /x is reasonable. Three red objects out of confidence interval defined as outliers are the cluster centers as we

expected.


Fig. 4. Fitting curve and residual distribution. (For interpretation of the references to color in the text, the reader is referred to the web version of this

article.)

3.3.5. The process of ACC-FSFDP algorithm

Step 1: According to the dominant analysis, the data set is pre-processed. Select the corresponding distance measure

metrics.

Step 2: Compute the distance δ and field intensity ϕ of each object in data set D .

Step 3: Transfer the nonlinear function y = b 0 + b 1 /x to linear function y = b 0 + b 1 ∗ x ′ , where x ′ =

1 x to fit the functional

relationship δ∗i

= f ( ϕ i ) of field intensity ϕ and distance δ.

Step 4: Use residual analysis to get the number of clusters k and cluster centers { C 1 , C 2 ... C k } automatically.

Step 5: Each remaining object is assigned to the same cluster as its nearest neighbor of higher field intensity on basis of

cluster centers { C 1 , C 2 ... C k }.

3.4. Basic definition of Str-FSFDP

Assume that the input data are sequence of samples X 1 , X 2 ,…, X i ,…, arriving at time stamps T 1 , T 2 ,…, T i ,…. Each

sample X i = { X 1 i , X 2

i , . . . , X d

i } represents by d attributes has r numeric attributes and q categorical attributes, which is de-

noted by X i = C i : B i = (x 1 i , x 2

i , . . . , x r

i , y 1

i , y 2

i , . . . , y

q i ) , where C i consists of numerical attributes and denoted by the vector of

x 1 i , x 2

i , . . . , x r

i . On the other hand, B i consists of categorical attributes and denoted by the vector of y 1

i , y 2

i , . . . , y

q i .

Definition 10. (Data density) For each data object x , if x arrives at time t c , whose time stamp is defined as T ( x ) = t c , and

density coefficient D ( x, t ) at time t is defined as

D (x, t) = 2

−λ(t−T (x )) = 2

−λ(t−t c ) (14)

where λ ∈ (0 , 1) is a constant value called decay factor.

Definition 11. (Micro cluster density) For each micro cluster mc , at a given time t , let E ( mc , t ) be data object set assigned to

mc at or before time t . Its density D ( mc, t ) is defined as the sum of the density coefficients of all data objects that mapped

to mc . The density of mc at time t is

D (mc, t) =

∑

x ∈ E(mc,t)

D (x, t) (15)

The density of any micro cluster is constantly changing. However, it is unnecessary to update the density values at every

time stamp. Instead, the density values are possible to be updated when a new data object is assigned to that micro cluster.

And the time stamp of the last arrived data should be recorded, so that the density of micro cluster can be updated when

a new data object arrives at the micro cluster.

Proposition 1. Suppose a micro cluster mc receives a new data object at time t n , and suppose that the time receiving the last

data object is t l ( t l < t n ), then the density of mc can be updated by

D (mc, t n ) = 2

−l( t n −t l ) D (mc, t l ) + 1 (16)

Proof. Let X = { x 1 , x 2 . . . x m

} be the set of data objects in mc at time t l , we have

D (mc, t l ) =

m ∑

i =1

D ( x i , t l )


C

according to formula (14) , we have

D ( x i , t n ) = 2

−λ( t n −t l ) D ( x i , t l )

Therefore, we have

D (mc, t n ) =

m ∑

i =1

D ( x i , t n ) + 1 =

m ∑

i =1

2

−λ( t n −t l ) D ( x i , t l ) + 1 = 2

−λ( t n −t l ) D (mc, t l ) + 1

Proposition 2. Let X ( t ) be the set of all data objects that arrive from 0 to t , and the sum of the density of all the data is never

exceed 1

1 −2 −λ . The average density of each micro cluster would not exceed 1

N(1 −2 −λ) , where N is the number of micro clusters .

Proof. For a given time t , ∑

x ∈ X(t) D (x, t) is the sum of density coefficient of the t + 1 data objects that arrives at time stamps

0, 1, … , t , respectively. For a data object x arriving at time t ′ , 0 ≤ t ′ ≤ t , its density is D (x, t ) = 2 −λ(t −t ′ ) .Therefore, the sum

over all the data objects is

∑

x ∈ X(t)

D (x, t) =

t ∑

t ′ =0

2

−λ(t −t ′ ) =

1 − 2

−λ(t+1)

1 − 2

−λ≤ 1

1 − 2

−λ(17)

Proposition 2 shows that the sum of the density of all data objects would not exceed

1

1 −2 −λ . Assume N is micro cluster

numbers, so the average density of each micro cluster would not exceed

1

N(1 −2 −λ) .

Definition 12. (Dense micro cluster and sparse micro cluster) At time t , a micro cluster including a set of data ob-

jects X il , . . . , X in arrives at time T il , . . . , T in . D ( mc , t ) is the density of micro cluster. If

D (mc, t) ≥ μ

N(1 − 2

−λ) = D thred (18)

where μ > 1 is controlling threshold, then the micro cluster is a dense micro cluster, else the micro cluster is a sparse micro

cluster,

Definition 13. (Characteristic vector) A micro cluster mc is defined as a tuple ( CF1 , CF2 , CF3 , H ( t ), T 0 , T l , D , Status ), where

F 1 =

∑ x j=1 D ( X i , t) X

j i

is the weighted linear sum of the objects with numerical attributes. CF 2 =

∑ x j=1 D ( X i , t) X

j2 i

is the

weighted squared sum of the objects with numerical attributes. CF 3 =

∑ y j=1

dist ( X i , C) is the sum of the categorical part

distance between the objects in micro cluster with its center. H ( t ) records the frequency histogram of categorical attributes

with time stamps. T 0 is the time micro cluster mc is created, T l is the last time micro cluster is updated, D is the micro

cluster density at last updated. Status = { Dense , Sparse } is a label used for dense micro cluster and sparse micro cluster.

The center of the micro cluster consists of two parts, the numerical attributes are denoted by the mean value C n =

CF 1 D (mc,t)

,

and the categorical attributes are denoted by the frequency histogram. The radius of numerical part of micro cluster is

R 1 =

√

CF 2 D (mc,t)

− ( CF 1 D (mc,t)

) 2 and the radius of categorical part of micro cluster is R 2 =

CF 3 D (mc,t)

, then the radius of micro cluster

is R = R 1 + R 2 .

The distance of numerical part between data object X i and micro cluster mc is

d ( X i , mc ) n =

√

r ∑

p=1

(X i

p n − C p n

)2 (19)

where C is the center of the micro cluster.

For each categorical attribute of data object X i and micro cluster mc , the simple matching approach is adopted, the

distance of the p th attribute is

d (X i

p c , C

p c

)=

{0 X i

p c = C p c

1 X i p c � = C p c

(20)

And the distance of categorical part is

d ( X i , mc) c =

q ∑

p=1

d( X i p c , C

p n ) (21)

3.5. Online maintenance for Str-FSFDP

Clustering evolutionary data stream, new micro clusters appear constantly and old micro clusters would decay and dis-

appear. When a new data object arrives, we try to merge it into the nearest dense micro clusters or sparse micro clusters.


Algorithm 1 Process().

A new data object x t arrives;

For(each dense micro cluster dmc)

If( d ( X t , d mc) < ε ); Merge x t into dmc ;

Else For(each sparse micro cluster smc )

If( d( X t , smc) < ε ); Merge x t into smc ;

If( D ( smc, t ) > D thred ) then

Remove the smc from the sparse micro clusters;

The micro cluster smc develops as a dense micro cluster.

End if

Else

Create a new sparse micro cluster and put data x t into it;

End if

END

If the new data object cannot absorbed by those existing micro clusters, then create a new sparse micro cluster with data

object added. The merging procedure is described as Algorithm 1 .

The number of micro clusters will be extraordinary large over time, which will cost much memory space and time

consumption of dealing with each new data object. Therefore, a mechanism for micro clusters deletion detection is necessary

to maintain micro clusters.

3.5.1. Detection interval TimeGap

The density of micro clusters keeps changing over time. Dense micro cluster may degenerate to a sparse micro cluster

if it does not receive new data for a long time. On the other hand, a sparse micro cluster can be upgraded to a dense

micro cluster when it receives enough new data objects. Therefore, after a period of time, the density of each micro cluster

should be inspected. A key decision is the length of the time interval for micro cluster inspection. It is noted that the

time interval gap cannot be too large or too small. If the gap is too large, dynamical changes of data streams will not be

adequately recognized. If the gap is too small, it will result in frequent computation by the offline component and increase

the workload. When such computation load is too heavy, the processing speed of the offline component may not match the

speed of the input data stream. In Str-FSFDP, we consider the minimum time needed for a dense micro cluster to degenerate

to a sparse micro cluster as the interval gap to ensure the inspection is frequent enough to detect the density changes of

any micro cluster.

Proposition 3. For any dense micro cluster mc, the minimum time needed for mc to become a sparse micro cluster from being a

dense micro cluster is

TimeGap =

1

λlog

(D thred

D thred − 1

)(22)

Proof. According to formula (18) , if at time t , a micro cluster g is a dense micro cluster, then we have

D (mc, t) ≥ μ

N(1 − 2

−λ) = D thred (23)

Suppose after δt time, mc becomes a sparse micro cluster. At time t + δt , micro cluster g receives a data object, which

make the density of mc become D thred , we have

D (mc, t + δt ) + 1 = D thred

2

−λδt D (mc, t) + 1 = D thred

2

λδt =

D (mc, t)

D thred − 1

δt =

1

λlog

(D (mc, t)

D thred − 1

)

Because TimeGap = min δt , D (mc, t) ≥ D thred , so the interval T imeGap =

1 λ

log ( D thred

D thred −1 ) . Therefore, for every time interval

TimeGap , all the micro clusters would be detected. If no new data object arrives in the dense micro cluster dmc , and its

density decays less than threshold, which means the dmc is already degenerate to outliers. It should be deleted to free

memory space for other new micro clusters.


3.5.2. Sparse micro cluster deletion strategy

With the continuous arrival of data stream, the number of sparse micro clusters will be more, becoming a system bur-

den. Therefore, a deletion strategy is proposed in our algorithm which considers two cases. One is the old sparse micro

clusters, which denotes the micro cluster is overdue and cannot to describe the current data. A lot of research show that

the data stream arrival is approximate to Poisson distribution, which means the time interval obeys the negative exponential

distribution. Therefore, homogeneous Poisson process model is adopted to model the arrival time of samples in the same

micro cluster. The arrival time difference of two adjacent samples obeys the negative exponential distribution of parameters

λ. The mean value of y is used for unbiased estimate of 1/ λ, we have

1

λ̄= y =

∑ N i =1 y n

N

=

T 1 − T 0 N

(24)

Delete judgment for each sparse micro cluster if T − T l >

θ ( T l −T 0 )

D (mc,T ) , then the micro cluster is overdue and should be deleted,

where θ is a cutoff threshold and set as θ = − ln (0 . 001) ≈ 7 .

Another case is the sparse gird with small density value should be deleted, which shows that the sparse micro cluster

is probably outliers. Since plenty of time is needed for the existing sparse micro clusters to grow into dense micro clusters,

meanwhile, the real noise sparse micro cluster should be deleted in time. But this strategy will cost a lot of memory.

Therefore, the lowest weight detection mechanism is proposed to remove real noise sparse micro cluster periodically. The

lowest weight is defined as

ξ ( T c , T 0 ) =

2

−λ( T c −T 0 + gap ) − 1

2

−λT p − 1

(25)

where the sparse micro cluster is unlikely to grow into a dense cluster would be probably caused by noise.

Algorithm 2 Update ().

If (Tc mod gap == 0)

For(each dense micro cluster dmc)

If( D ( dmc, T c ) < D thred )

The dense micro cluster dmc is degenerate into outliers, remove this dense micro cluster;

For(each sparse micro cluster smc)

If( D ( smc, T c ) < ξ ( T c , T 0 ) )

The sparse micro cluster is unlikely to grow into a dense cluster, remove this sparse micro cluster;

Else If( T − T l >

θ ( T l −T 0 ) D (smc,T )

)

Then the sparse micro cluster is overdue, remove this sparse micro cluster;

END if

End

3.6. Offline clustering for Str-FSFDP

When a clustering request arrives, the improved density-based method is applied to the set of online maintained micro

clusters to get the final result of clustering. Every micro cluster will be regarded as a virtual object located in the center of

micro cluster.

Algorithm 3 OfflineCluster().

Input: micro clusters information

Output: final clustering result at any time

Do

Get ( mc ); // Get a unprocessed dense micro cluster.

If ( mc is a dense micro cluster) then

Find all the density-connected micro clusters from mc to form a cluster.

Else

Break;

End if:

Until all the micro cluster are processed.

4. Simulations and analysis

4.1. Simulations and analysis for ACC-FSFDP

Twelve data sets from the widely used UCI Machine Learning Repository are used shown in Table 1 . Iris, Breast and KDD

CUP’99 data sets are numerical domains data, Soybean, Zoo and Acute data sets are categorical domains data, and Statlog


Fig. 5. Clustering results on example data sets.

Heart and Credit data sets are balanced data. In all data sets, the class label attribute is not involved in the clustering

process, only used to evaluate clustering accuracy of clustering algorithms.

4.1.1. Performance evaluation

(1) In clustering analysis, the clustering accuracy ( r ) is one of the most commonly used criteria to evaluate the quality of

clustering results, defined as

r =

∑ k i =1 a i n

(26)

where a i is the number of data objects occurring both in i th cluster and its corresponding true class, and n is the number of

data objects in the data sets. According to this measure, the larger r is, the better the clustering results are, and for perfect

clustering r = 1.0.

(2) The another clustering quality measure is the average purity of clusters defined as

P ur =

k ∑

i =1

∣∣C d i

∣∣| C i | /k (27)

where k denotes the number of clusters. [ ε i − σ ∗ Z α/ 2 , ε i + σ ∗ Z α/ 2 ] denot es the number of objects with the dominant class

label in cluster i . | C i | denotes the number of objects in cluster i . Intuitively the purity measures the purity of the clusters

with respects to the true cluster (class) labels that are known for our data sets.

4.1.2. Result analysis

Unless particularly mentioned, ACC-FSFDP only need to set confidence factor α = 0.05 in prior. There are four 2-

dimensional data sets (Aggregation, Jain, Spiral and Flame) with various shapes of clusters (circular, elongated, spiral, etc.).

The clustering results by ACC-FSFDP are presented in Fig. 5 . The results show that the proposed algorithm has good cluster-

ing quality in clustering arbitrary shape and variable density data sets.

The linear function δ∗i

= b 0 + b 1 ∗ ϕ

′ and residuals analysis are used to determine the cluster centers. The field intensity

and distance distribution is shown in Fig. 6 . Fig. 7 shows the results of residuals analysis. It is observed that the red circles

are cluster centers in corresponding data sets.

Clustering quality is evaluated by clustering purity and accuracy. The comparison results of ACC-FSFDP and other typical

algorithms are listed in Table 2.

We also have experiments on KDD CUP’99 data sets, taking 10 0 0 objects as an interval, we select a number of represen-

tative sample data set for testing. Each sample set has 10 0 0 objects. For example, in Table 3 , when t = 150, there are 373

“normal” appear, 380 “Satan” attacks, 5 “Bufoverflow” attacks 99 “teardrop” attacks and 143 “Smurf” attacks. When t = 350,

there are 381 “normal” appear and 618 “Neptune” attacks. Clustering accuracy and purity of four sample data sets are listed

to testity the clustering performances.

The following reasons contribute to the better performance of our proposed algorithm.

1. A novel mixed data distance measure metric is adopted based on dominance analysis. For numerical dominant data and

categorical dominant data, the impact of non-dominant attributes are reduced on overall similarity, while for balanced

mixed data, importance of each attribute is considered. ACC-FSFDP benefits from the distance measure metric to get

better clustering quality.

2. Field intensity and distance relationship of data is analyzed in ACC-FSFDP, linear regression analysis techniques is adopted

to fit the functional relationship δ∗i

= f ( ϕ i ) . By residuals distribution analysis, the cluster centers are determined auto-

matically. The remaining data objects would be clustered by one-time-scan.


Fig. 6. Corresponding ϕ and δ distribution on eight data sets with optimal σ .


Fig. 7. Residuals distribution on eight data sets with optimal σ . (For interpretation of the references to color in the text, the reader is referred to the web

version of this article.)


Table 2

Clustering quality comparison on real data sets of ACC-FSFDP and other typical algorithms.

Clustering algorithms Iris Breast Soybean Zoo Acute Heart Credit

r Pur r Pur r Pur r Pur r Pur r Pur r Pur

K-prototypes 0 .819 0 .842 0 .84 0 .862 0 .856 0 .877 0 .806 0 .83 0 .610 0 .72 0 .577 0 .644 0 .562 0 .624

SBAC 0 .426 0 .46 0 .45 0 .53 0 .617 0 .642 0 .426 0 .485 0 .508 0556 0 .752 0778 0 .555 0 .627

KL-FCMGM 0 .335 0 .382 0 .673 0 .724 0 .903 0 .895 0 .864 0 .843 0 .682 0 .749 0 .758 0 .802 0 .574 0 .632

EKP 0 .432 0 .45 0 .53 0 .61 0 .831 0 .78 0 .629 0 .694 0 .508 0 .586 0 .545 0 .589 0 .682 0 .749

WFKprototypes 0 .871 0 .82 0 .761 0 .78 0 .81 0 .82 0 .908 0 .876 0 .710 0 .765 0 .835 0 .826 0 .838 0 .826

IWKM 0 .822 0 .84 0 .882 0 .893 0 .908 0 .922 0 .51 0 .56 0 .62 0 .54 0 .55 0 .62 0 .779 0 .806

DBSCAN 0 .695 0 .72 0 .655 0 .71 0 .56 0 .61 0 .76 0 .891 0 .901 0 .87 0 .543 0 .602 0 .678 0 .782

BIRCH 0 .86 0 .89 0 .72 0 .763 0 .782 0 .81 0 .82 0 .781 0 .802 0 .81 0 .823 0 .81 0 .761 0 .77

SpectralCAT 0 .96 0 .98 0 .921 0 .939 0 .894 0 .895 0 .881 0 .91 0 .867 0 .824 0 .82 0 .824 0 .77 0 .794

TGCA 0 .90 0 .90 0 .92 0 .91 0 .891 0 .89 0 .42 0 .53 0 .832 0 .87 0 .88 0 .83 0 .78 0 .79

k -modes 0 .84 0 .831 0 .84 0 .863 0 .915 0 .9 0 .878 0 .867 0 .78 0 .767 0 .783 0 .84 0 .87 0 .723

F k -modes 0 .792 0 .821 0 .87 0 .887 0 .957 0 .985 0 .637 0 .702 0 .75 0 .75 0 .831 0 .832 0 .78 0 .83

OCIL 0 .912 0 .93 0 .916 0 .928 0 .898 0 .902 0 .732 0 .764 0 .763 0 .786 0 .827 0 .831 0 .713 0761

DC-MDACC 0 .96 0 .964 0 .938 0 .945 0 .957 0 .985 0 .892 0 .849 0 .917 0 .918 0 .848 0 .833 0 .796 0 .833

ACC-FSFDP 0 .96 0 .958 0 .938 0 .947 0 .957 0 .985 0 .874 0 .862 0 .90 0 .921 0 .832 0 .842 0 .784 0 .814

� 0 .23 0 .23 3 .43 3 .43 2 .62 2 .62 1 .49 1 .49 1 .06 1 .06 0 .18 0 .18 0 .24 0 .24

Table 3

Clustering quality on KDD CUP-99 sample data sets.

The number of all kinds attacks Sample data sets ( t )

150 250 350 450

Normal 373 381 215

Satan 380

Bufoverflow 5

Teardrop 99

Smurf 143 10 0 0 785

Neptune 618

Land 1

Total objects 10 0 0 10 0 0 10 0 0 10 0 0

Accuracy ( r ) 0.977 1.0 0.972 0.997

Purity 0.967 1.0 0.969 0.996

Fig. 8. Average execute time comparison of ACC-FSFDP and other typical algorithms on eight data sets.

4.1.3. Time performance comparison

Fig. 8 lists the average execute time of our proposed algorithm and other algorithms on eight data sets. Because the

number of data objects in Iris, Soybean, zoo and acute data sets are small, the execution is fast. The KDD-CUP sample data

sets and Breast data have relatively more data objects, and the execute time is longer. Since the balanced data, like Heart

and Credit adopt the probability and statistics method in the pre-treatment stage, therefore, they need more time than

others.

4.1.4. Complexity analysis

Assume that the data set has n data object, the time complexity of ACC-FSFDP mainly consists of the computation of field

intensity and distance for each data object, and the computational cost are O( n 2 ) and O(( n 2 - n )/2). After the cluster centers


Table 4

Details for synthetic and real-world data sets used.

Data set Number of objects Number of attributes Number of clusters

RBF5 300 ,0 0 0 5 5

RBF10 300 ,0 0 0 10 5

RBF15 300 ,0 0 0 15 5

DS1 100 ,00 2 4

DS2 10 ,0 0 0 2 4

DS3 10 ,0 0 0 2 4

EDS 100 ,0 0 0 2 4

KDD CUP 99 494 ,031 41 24

Table 5

The purity of synthetic data sets.

Data sets Purity

RBF5 1.00

RBF10 1.00

RBF15 0.98

DS1 1.00

DS2 1.00

DS3 1.00

are found, the cluster assignment is performed by one-time-scan, and the corresponding computational cost is O( n –k ), where

k denotes the number of cluster centers. The time complexity of partition-based clustering algorithms and hierarchical clus-

tering algorithms is O(iter ∗k ∗n ) and O( n 2 ). So the time complexity of ACC-FSFDP is higher than the partition-based clustering

algorithms and similar to hierarchical clustering algorithms. ACC-FSFDP can determine the cluster centers automatically, and

can deal with arbitrary shape clusters with no sensitive parameters.

4.2. Simulations and analysis for Str-FSFDP

On basis of ACC-FSFDP as the initialization stage, Str-FSFDP is designed for mixed data stream clustering. The synthetic

data and real-world data sets used to testify the performance of Str-FSFDP. The data sets are listed in Table 4 . The first

data set used for evaluation is random Radial Basic Function synthetic data set, which has been widely used for evaluating

clustering algorithm for evolving data stream. We generate three synthetic data sets RBF5, RBF10 and RBF15 with 5, 10 and

15 attributes respectively. These data sets consist of 5 clusters and 30 k objects. The synthetic data adopts the evolving data

stream EDS composed by the synthetic data sets DS1, DS2 and DS3. We generate an evolving data stream by randomly

choosing one of the synthetic data sets (DS1, DS2 and DS3) 10 times, each of the synthetic data set contains 10 0 0 0 objects

depicted in Fig. 12 (a)–(c), so the total length of the evolving data stream is 10 0,0 0 0. And in order to test the scalability

of Str-FSFDP, we also generate some synthetic data sets with different numbers of dimensions and numbers of clusters

by random radial basis function generator. The following notations are adopted to characterize the synthetic data sets: “B”

denotes the number of objects, “C” denotes the number of clusters and “D” denotes the number of dimensions. For example,

B100KC10D25 means the data sets contains 10 0,0 0 0 data objects with 25 dimensions, and belong to 10 different clusters.

The real data set is the KDD-CUP’99 Network Intrusion Detection data set. The experiment extracted 10 percentage of the

sample for testing, which includes 494,031 objects.

Unless particularly mentioned, the parameters of Str-FSFDP algorithm are set as follows: initial number of objects

InitN = 10 0 0, decay factor λ = 0.35, dense threshold μ = 4. v denotes the speed of data stream, H denotes the horizon.

4.2.1. Clustering quality evaluation

First the RBF data sets are testified, since RBFs are random Radial Basic Function synthetic data sets, which are pure

numerical data sets. Therefore, the Sum of Squared Distance (SSQ) is used to evaluate the accuracy of DenStream algorithm

and Str-FSFDP, and SSQ is defined as SSQ (�) =

∑

d 2 ( x i , v x i ) . Fig. 9 illustrates the cluster quality in terms of SSQ scores,

respectively. It shows that the Str-FSFDP always has lower SSQ values than DenStream, and more stable than DenStream

too.

Then, we test the proposed algorithm on the DS1, DS2, DS3 and EDS data sets, and select some representative time for

testing. The results of DSs data sets are shown in Fig. 10 . The purity of RBF5, RBF10, RBF15, DS1, DS2 and DS3 clustered by

Str-FSFDP are listed in Table 5.

The clustering quality of EDS and KDD CUP’ 99 are evaluated by the average purity of clusters. Clustering purity is listed

in Table 6 . Fig. 11 (a) shows the clustering purity results of Str-FSFDP and other comparison algorithms on the EDS data set

with different horizon and different speed. Fig. 11 (a) shows in horizon H = 1 and speed v = 10 0 0, Str-FSFDP has clustering

purity higher than 95%. The results Str-FSFDP with horizon H = 10 and speed v = 100 are shown in Fig. 11 (b) that Str-FSFDP


Fig. 9. Comparison of SSQ for RBF data sets by Str-FSFDP.

Fig. 10. Synthetic data sets clustering results by Str-FSFDP.

Table 6

Global clustering purity.

Data set Horizon Str-FSFDP StrDenAP StrAP DenStream HCluStream CluStream

EDS H = 1, v = 10 0 0 96 .48 91 .49 87 .67 84 .60 84 .42 81 .24

H = 10, v = 100 97 .32 92 .33 88 .12 84 .98 87 .06 84 .26

KDD’99 H = 5, v = 10 0 0 98 .64 97 .89 94 .81 92 .75 93 .46 91 .26

H = 10, v = 500 98 .93 98 .06 95 .79 93 .44 94 .72 92 .64

has the highest purity. Fig. 11 (c) and (d) shows the clustering purity results of Str-FSFDP and other comparison algorithms

on the Network Intrusion data set with different horizon and speed. In horizon H = 1 and speed v = 10 0 0, Str-FSFDP has

clustering purity higher than 95%. And we also test Str-FSFDP when the horizon H = 5, and speed v = 10 0 0, the results are

shown in Fig. 11 (c). We can get a similar result with Fig. 11 (d), the purity of Str-FSFDP is highest.

The following reasons contribute to the better performance of Str-FSFDP algorithm.

1. ACC-FSFDP is adopted for initialization, the regression analysis techniques is adopted to fitting the relationship of density

and distance of every data object, and use residual analysis to determine the centers automatically, so that Str-FSFDP can

get a good initial clustering results.


Fig. 11. Clustering purity on EDS data set and KDDCUP’99.

2. A new micro cluster characteristic vector is introduced, which can capture the characteristic of mixed data object and its

distribution accurately. It makes the clustering purity of our proposed algorithm more accurate.

3. The online micro cluster maintenance mechanism of Str-FSFDP can delete the outliers in time and cluster the potential

micro clusters, which improves the clustering efficiency and clustering quality.

4.2.2. Execution time analysis

Str-FSFDP are compared with StrDenAP and StrAP in aspect of execution time. Fig. 12 shows that the execution time of

three algorithms increases with the data size, while Str-FSFDP costs much less time than StrDenAP and StrAP with EDS data

stream length. StrAP needs to rebuilt model when current model changes, and store lots of history t data in process, while

Str-FSFDP deteles the expired or low weight data in time to improve execute speed.

Execution time of Str-FSFDP is also evaluated by data streams with various dimensionality and cluster number. The first

series of data sets were generated by varying the cluster number C from 10 to 50, while fixing dimensionality number100 k

data stream. Fig. 13 (a) shows that execution time almost fixed with cluster class increases, in other word, dense micro-

clusters number stays fixed as clusters number increases. Another data sets are generated by varying dimensionality D from

10 to 40, while fixing clusters number C and 100 k data stream. Fig. 13 (b) shows that as the dimensionality increases, the

exectuion time increases linearly.

4.2.3. Memory usage

Both KDD99 and EDS data stream are used to evaluate memory usage of Str-FSFDP. The memory usage is measured

by micro-cluster numbers. Fig. 14 shows that memory usage is bounded as the stream proceeds. Because micro-clusters

maintenance mechanism is adopted in the online stage of Str-FSFDP by deleting overdue dense micro-clusters and potential

micro-clusters of low density.


Fig. 12. The execution time comparision on KDDCUP’99 and EDS data set.

Fig. 13. Execution time with dimensions and cluster number.

Fig. 14. Memory usage evaluation of Str-FSFDP.


Fig. 15. Parameter sensibility analysis of Str-FSFDP.

4.3. Sensitivity analysis

Decay factor λ and dense threshold μ are two parameters in Str-FSFDP. Decay factor λ controls the importance of his-

torical data to current clusters, the larger the value set, the faster the data decay, and the smaller impact of historical data.

Our experiments set λ = 0.35. Clustering qualities are testified by varying λ from 0.0625 to 8 shown in Fig. 15 (a). When λis relatively smaller or large, Str-FSFDP gets poor clustering quality. It can been seen that if the λ ranges from 0.125 to 1,

Str-FSFDP can achieve clustering quality higher than 95%.

Dense threshold μ is key for distinguishing the dense micro-clusters and sparse micro-clusters. Dense threshold μ con-

trols the roportion of dense micro-clusters. Fig. 15 (b) shows that clustering quality of Str-FSFDP when μ is varying from 1

to 10. When μ ranges from 2 to 6, clustering quality is good, especailly μ is 4, Str-FSFDP promises highest purity.

5. Conclusion

A fast density-based data stream clustering algorithm called Str-FSFDP is brought up to cluster mixed data stream ac-

curately. ACC-FSFDP is designed for initialization by determining cluster centers automatically. A novel micro-cluster vector

is introduced for mixed data and updated dynamically. Clusters decay function and deletion mechanism of micro clusters

are applied to maintain the micro cluster in the online stage, which makes the algorithm more consistent with the intrinsic

characteristics of the original mixed data stream, and make it not sensitive to the outliers. The future research will focus on

the parameter optimization section of mixed data stream clustering algorithm to achieve higher clustering quality.

Acknowledgments

The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work was

supported by a grant from National Natural Science Foundation of China (grant no. 61502423 ) and Zhejiang Provincial Nat-

ural Science Foundation (grant no. Y14F020092 ).

References

[1] R. Alex , L. Alessandro , Clustering by fast search and find of density peaks, Science 344 (2014) 14 92–14 96 . [2] C.C. Aggarwal , J.W. Han , J.Y. Wang , P.S. Yu , A framework for clustering evolving data streams, in: Proceedings of the 29th International Conference on

Very Large Data Bases, VLDB Endowment, 29, 2003, pp. 81–92 . [3] C.C. Aggarwal , J.W Han , J.Y Wang , P.S. Yu , A framework for projected clustering of high dimensional data streams, in: Proceedings of the 30th Interna-

tional Conference on Very Large Data Bases, VLDB Endowment, 30, 2004, pp. 852–863 .

[4] A. Amir , D. Lipika , A k-mean clustering algorithm for mixed numeric and categorical data, Data Knowl. Eng. 63 (2) (2007) 503–527 . [5] S. Arkajyoti , D. Swagatam , Categorical fuzzy k-modes clustering with automated feature weight learning, Neurocomputing 166 (10) (2015) 422–435 .

[6] V. Bhatnagar , S. Kaur , Clustering data streams using grid-based synopsis, Knowledge Inf. Syst. 41 (2014) 127–152 . [7] F. Cao , M. Ester , W. Qian , et al. , Density-based clustering over an evolving data stream with noise, Proceedings of the SIAM Conference on Data Mining,

2006, pp. 326–337 . [8] L. Cen , B. Gautam , Unsupervised learning with mixed numeric and nominal data, IEEE Trans. Knowl. Data Eng. 14 (4) (2002) 673–690 .

[9] J.Y. Chen , H.H. He , Research on density-based clustering algorithm for mixed data with determine cluster centers automatically, Acta Autom. Sin. 41

(10) (2015) 1798–1813 . [10] Y. Chen , L. Tu , Density-based clustering for real-time stream data, in: Proceedings of the Thirteenth International Conference on Knowledge Discovery

and Data Mining, ACM, San Jose, 2007, pp. 133–142 . [11] C.H. Chung , C.C. Yu , Mining of mixed data with application to catalog marketing, Expert Syst. Appl. 32 (1) (2007) 12–23 .

[12] C.H. Chung , L.C. Chen , W.S. Yu , Hierarchical clustering of mixed data based on distance hierarchy, Inf. Sci. 177 (20) (2007) 4 474–4 492 . [13] G. David , A. Averbuch , SpectralCAT: categorical spectral clustering of numerical and nominal data, Pattern Recognit. 45 (1) (2012) 416–433 .

http://refhub.elsevier.com/S0020-0255(16)30003-2/sbref0001















































[14] H. Hong , H.T. Yong , A two-stage genetic algorithm for automatic clustering, Neurcocomputing 81 (2) (2012) 49–59 . [15] Z.X. Huang , Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and

Data Mining, World Scientific Publishing, Singapore, 1997, pp. 21–34 . [16] Z.X. Huang , A fast clustering algorithm to cluster very large categorical data sets in data mining, in: Proceedings of the SIGMOD Workshop on Research

Issues on Data Mining and Knowledge Discovery, ACM Press, Arizona, 1998, pp. 1–8 . [17] Z.X. Huang , M.K. Ng , A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst. 7 (4) (1999) 446–452 .

[18] W.H. Jia , M. Kamber , Data Mining Concepts and Techniques, Morgan Kaufmann, San Francisco, 2001, pp. 179–182 .

[19] C.J. Jin , B. Tian , G.Z. Chung , M. Chao , W. Zhe , An improved k-prototypes clustering algorithm for mixed numeric and categorical data, Neurocomputing120 (1) (2013) 590–596 .

[20] C.J. Jin , P. Wei , G.Z. Chung , H. Xiao , W. Zhe , A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data, Knowledge Based Syst.30 (1) (2012) 129–135 .

[21] S.Y. Miin , C.T. Yi , Bias-correction fuzzy clustering algorithms, Inf. Sci. 309 (10) (2015) 138–162 . [22] Z. Sobia , A.G. Mustansar , K. Asra , A .A . Muhammad , N. Usman , P.B. Adam , Novel centroid selection approaches for KMeans-clustering based recom-

mender systems, Inf. Sci. 320 (1) (2015) 156–189 . [23] P.C. Sotirios , A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimi-

larity functional, Expert Syst. Appl. 38 (7) (2011) 86 84–86 89 .

[24] Z. Tian , R. Raghu , L. Miron , BIRCH: an efficient data clustering method for very large databases, in: Proceedings of the 1996 ACM SIGMOD InternationalConference on Management of Data, ACM Press, Montreal, 1996, pp. 103–114 .

[25] L. Tu , Y. Chen , Stream data clustering based on grid density and attraction, ACM Trans. Knowl. Discovery Data 3 (3) (2009) 1–27 . [26] G.W. Yan , Y. De , M., Jian , J.M. Wang , An hierarchical clustering method based on data fields, Acta Electron. Sin. 34 (2) (2006) 258–262 .

[27] M.C. Yiu , J. Hong , Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number, PatternRecognit. 46 (8) (2013) 2228–2238 .

[28] J.P. Zhang , F.C. Chen , S.M. Li , L.X. Liu , Data stream clustering algorithm based on density and affinity propagation techniques, Acta Autom. Sin. 40 (2)

(2014) 277–288 . [29] X. Zhang , C. Furtlehner , M. Sebag , Data streaming with affinity propagation, in: Proceedings of the 2008 Machine Learning and Knowledge Discovery

in Databases, Springer, Berlin, Heidelberg, 2008, pp. 628–643 . [30] Z. Zhi , M.G. Mao , J.M. Jing , L.J. Chen , D.W. Qiao , Unsupervised evolutionary clustering algorithm for mixed type data, in: Proceedings of the 2010 IEEE

Congress on Evolutionary Computation, CEC, Barcelona, 2010, pp. 1–8 .




































































a fast density-based data stream clustering algorithm with ...ivsn-group.com/article/chenjinyin/a...

Documents