tasc: two-attribute-set clustering through decision tree construction

15
Computing, Artificial Intelligence and Information Management TASC: Two-attribute-set clustering through decision tree construction Yen-Liang Chen * , Wu-Hsien Hsu, Yu-Hsuan Lee Department of Information Management, National Central University, Chung-Li, 320 Taiwan, ROC Received 30 October 2003; accepted 14 April 2005 Available online 27 June 2005 Abstract Clustering is the process of grouping a set of objects into classes of similar objects. In the past, clustering algorithms had a common problem that they use only one set of attributes for both partitioning the data space and measuring the similarity between objects. This problem has limited the use of the existing algorithms on some practical situation. Hence, this paper introduces a new clustering algorithm, which partitions data space by constructing a decision tree using one attribute set, and measures the degree of similarity using another. Three different partitioning methods are presented. The algorithm is explained with illustration. The performance and accuracy of the four partitioning methods are evaluated and compared. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Data mining; Clustering; Decision tree 1. Introduction Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collec- tion of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Due to its widespread usage in many applications, many clustering techniques have been developed (Ankerst et al., 1999; Basak and Krishnapuram, 2005; Bezdek, 1981; Chen et al., 2003; Friedman and Fisher, 1999; Grabmeier and Rudolph, 2002; Guha et al., 1998; Keim and Hinneburg, 1999; Jain et al., 1999; Kantardzic, 2002; Karypis et al., 1999; Klawonn and Kruse, 1997; Liu et al., 2000; Yao, 1998). In Han and Kamber (2001), the authors classify existing clustering techniques into five major categories: 0377-2217/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2005.04.029 * Corresponding author. Tel.: +86 886 3 4267266; fax: +86 886 3 4254604. E-mail address: [email protected] (Y.-L. Chen). European Journal of Operational Research 174 (2006) 930–944 www.elsevier.com/locate/ejor

Upload: yen-liang-chen

Post on 21-Jun-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TASC: Two-attribute-set clustering through decision tree construction

European Journal of Operational Research 174 (2006) 930–944

www.elsevier.com/locate/ejor

Computing, Artificial Intelligence and Information Management

TASC: Two-attribute-set clusteringthrough decision tree construction

Yen-Liang Chen *, Wu-Hsien Hsu, Yu-Hsuan Lee

Department of Information Management, National Central University, Chung-Li, 320 Taiwan, ROC

Received 30 October 2003; accepted 14 April 2005Available online 27 June 2005

Abstract

Clustering is the process of grouping a set of objects into classes of similar objects. In the past, clustering algorithmshad a common problem that they use only one set of attributes for both partitioning the data space and measuring thesimilarity between objects. This problem has limited the use of the existing algorithms on some practical situation.Hence, this paper introduces a new clustering algorithm, which partitions data space by constructing a decision treeusing one attribute set, and measures the degree of similarity using another. Three different partitioning methods arepresented. The algorithm is explained with illustration. The performance and accuracy of the four partitioning methodsare evaluated and compared.� 2005 Elsevier B.V. All rights reserved.

Keywords: Data mining; Clustering; Decision tree

1. Introduction

Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collec-tion of data objects that are similar to one another within the same cluster and are dissimilar to the objectsin other clusters. Due to its widespread usage in many applications, many clustering techniques have beendeveloped (Ankerst et al., 1999; Basak and Krishnapuram, 2005; Bezdek, 1981; Chen et al., 2003; Friedmanand Fisher, 1999; Grabmeier and Rudolph, 2002; Guha et al., 1998; Keim and Hinneburg, 1999; Jain et al.,1999; Kantardzic, 2002; Karypis et al., 1999; Klawonn and Kruse, 1997; Liu et al., 2000; Yao, 1998). InHan and Kamber (2001), the authors classify existing clustering techniques into five major categories:

0377-2217/$ - see front matter � 2005 Elsevier B.V. All rights reserved.doi:10.1016/j.ejor.2005.04.029

* Corresponding author. Tel.: +86 886 3 4267266; fax: +86 886 3 4254604.E-mail address: [email protected] (Y.-L. Chen).

Page 2: TASC: Two-attribute-set clustering through decision tree construction

Y NY

X1 ≥ 2

X1

< 2C1 : X

1 < 2

C2:X1 ≥ 2 ^ X2 < 5

X2 ≥ 5X

2 < 5

Fig. 1. An example of clustering through decision tree construction.

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 931

(A) partitioning-based clustering algorithms, such as K-means, K-medoids, and CLARANS, (B) hierarchi-cal-based clustering algorithms, which includes agglomerative approach (CURE and Chameleon) and divi-sive approach (BIRCH), (C) density-based algorithms, such as DBSCAN and OPTICS, (D) grid-basedalgorithms, such as STING and CLIQUE, (E) model-based algorithms, such as COBWEB.

All the techniques mentioned above have a common limitation that only one set of attributes is consid-ered for both partitioning the data space and measuring the similarity between objects. Which may not beapplied to some practical situation when two sets of attributes are required to accomplish the job (two-attri-

bute-set problem).Consider the following scenario: sales departments often need to cluster their customers so that different

promotion strategies could be applied accordingly. Some promotion strategies are designed for certaingroups of customers according to their consumption behaviors (first attribute set), e.g., average expenditure,frequency of consumption, etc. Hence, customers should be clustered by the attribute set of consumptionbehaviors. On the otherhand, in order to apply a promotion strategy to suitable customers, the departmentmight need to know the characteristics of every single cluster from the customers� personal information (sec-ond attribute set), e.g., age, gender, income, occupation, and education. Furthermore, the department maywant to use the patterns of the second attribute set to select potential customers before they can obtain anyconsumption information, and apply according promotion to those customers. This explains the require-ment of using two sets of attributes for the dataset-partitioning task and similarity-measuring task.

We adopt the concept of decision tree construction to solve the two-attribute-set problem in clusteringanalysis. In Fig. 1, the first and the second nodes are labeled as Y, meaning that the number of objects oflabel Y is larger than that of N (a dense space); the third node is labeled as N, meaning that the number ofobjects of label N is larger than that of Y (a sparse space). Thus, the result of the clustering produces twonodes with label Y. The feature of cluster 1 is X1 < 2, and that of cluster 2 is X1 P 2 and X2 < 5.

In this paper, we present a new clustering algorithm TASC, which allows different attribute sets for data-set-partitioning task (tree-construction) and clustering task (similarity-measuring) to give more flexibility totheir applications.

This paper is divided into five sections. Section 2 gives the description and the definition to the issue inquestion. Section 3 introduces the clustering algorithm. Experimental results are presented in Section 4, andwe make the conclusion of the research and a future perspective to this issue in Section 5.

2. Problem statement and definitions

Given a dataset X and an attribute set P = {P1, P2, . . ., Pr}, where P1, P2, . . ., Pr are all numerical attri-butes in X. Classifying attribute set A and clustering attribute set C are two subsets of P, where A and C areof the following relationship:

A \ C ¼ fP 1; P 2; . . . ; P sg; where P 1; P 2; . . . ; P s are arbitrary attributes of P .

Page 3: TASC: Two-attribute-set clustering through decision tree construction

Table 1A sample dataset

Age Income Average amount Frequency

001 20 10 2 6002 23 30 5 6003 31 45 1 3004 36 100 2 2005 42 200 10 10006 44 100 2 8007 48 130 3 7

932 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

The goal of the algorithm is to construct a decision tree with the attributes in A as non-leaf nodes to seg-mentation the data space. Each leaf node represents a cluster where all records in the node are of the sat-isfactory similarity measured by the attributes in C.

In order to gain a better understanding, we use Table 1 to explain the following definitions.

Definition 1. A is a subset of P. A = {A1, A2, . . ., Am} is a classifying attribute set if every Ai in A is used forconstructing the decision tree. Each Ai is called a classifying attribute.

Ex: A ¼ fA1;A2g; where A1: age; A2: income.

Definition 2. C is a subset of P. C = {C1, C2, . . ., Cn} is a clustering attribute set if every Ci in C is to beused for measuring the similarity between records in S. Each Ci is called a clustering attribute.

Ex: C ¼ fC1;C2g; where C1: average amount; C2: frequency.

Definition 3. For each Ai and Ci, the value of record x is denoted as Ai(x) and Ci(x) respectively.

Ex: C1ð003Þ ¼ 1; C2ð003Þ ¼ 3; and A1ð003Þ ¼ 31.

Definition 4. If we partition an attribute Pi into k intervals, the jth interval of Pi is denoted as P ji . For each

Ai, the jth interval of the attributes can be denoted as Aji .

Ex: A11: age < 30; A2

1: 30 6 age < 40 and A31: age P 40.

Definition 5. If we use Ai to partition node S into k sub-nodes, then we have S ¼ fs1i ; s

2i ; . . . ; sk

i g;sj

i ¼ fxjx 2 S;AiðxÞ 2 Ajig, for 1 6 j 6 k; and sy

i \ szi ¼ ;, for 1 6 y 6 k, 1 6 z 6 k and y 5 z.

Ex: From Table 1; let S ¼ f001; 002; 003; 004; 005; 006; 007g.

If we use A1 to classify node S into three intervals with A11: age < 30; A2

1: 30 6 age < 40 and

A31: age P 40; then we have s1

1 ¼ f001; 002g; s21 ¼ f003; 004g; s3

1 ¼ f005; 006; 007g.

Definition 6. The space volume of node S.

(a) Let Max(Ci(S)) denote the maximum value of Ci of all the records in S and similarly Min(Ci(S)) theminimum value of Ci, where 1 6 i 6 n.

.

Ex: For C1 and S ¼ f001; 002; 003; 004; 005; 006; 007g; we have MaxðC1ðSÞÞ ¼ 10 and

MinðC1ðSÞÞ ¼ 1; where the maximum value occurs at record 005 and the minimum at record 003

Page 4: TASC: Two-attribute-set clustering through decision tree construction

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 933

(b) Diff(Ci(S)) = Max(Ci(S)) � Min(Ci(S)).

Ex: DiffðC1ðSÞÞ ¼ 10� 1 ¼ 9; DiffðC2ðSÞÞ ¼ 10� 2 ¼ 8.

(c) The space volume of node S, represented by P(S), is defined asQn

i¼1DiffðCiðSÞÞ.

Ex: P ðSÞ ¼ 9 � 8 ¼ 72.

Definition 7. The volume of a degenerate space.

Attribute Ch is called degenerate in node sji if DiffðChðsj

i ÞÞ ¼ 0. In this case, we define

P ðsji Þ ¼

Qh�1y¼1DiffðCyðsj

i ÞÞ �Qn

y¼hþ1DiffðCyðsji ÞÞ.

Ex: For s11 ¼ f001; 002g; we have MaxðC2ðs1

1ÞÞ ¼ 6 and MinðC2ðs11ÞÞ ¼ 6. Thus; we have

Pðs11Þ ¼ DiffðC1ðs1

1ÞÞ ¼ 3.

Definition 8. The density of a node.

The density of node S, D(S), is defined as jSj/P(S).

Ex: DðSÞ ¼ 7=72.

Definition 9. The extreme values of classifying attributes in a node.

Let MaxðAiðsji ÞÞ denote the maximum value of Ai of all the records in sj

i and similarly MinðAiðsji ÞÞ the

minimum value of Ai, where 1 6 i 6 m.

Ex: For A1 and s31 ¼ f005; 006; 007g; we have MaxðA1ðs3

1ÞÞ ¼ 48 and MinðA1ðs31ÞÞ ¼ 42.

Definition 10. The expected number of records in a node.

Let nji denote the expected number of records in node sj

i if node sji has the same density as that of node S.

Then, nji ¼ DðSÞ � P ðsj

i Þ.

Ex: Since DðSÞ ¼ 7=72 and P ðs3

1Þ ¼ 24; we have n31 ¼ 7=72 � 24 ¼ 7=3 ffi 2.33.

Definition 11. Sparse nodes and dense nodes.

(a) Node S is sparse and will be considered as noise and removed from the dataset if:(1) D(S) < a * D(R), where a is a user-defined factor for the threshold of minimum density of a node

and R is the parent node of S. That is, the density of the node S is too low.(2) jSj < c * jXj, where is c a user-defined factor for the threshold of minimum number of records of a

node. That is, there are too few data existing in the node S.

(b) Node S is dense if D(S) is greater than b * D(R), where b is the factor for threshold density of a dense

node, and b > 1. A dense node will become a leaf-node without further process.

Definition 12. Virtual nodes.A virtual node ux;y

i ð1 6 x < k; 1 < y 6 k; x < yÞ is not really created but is used to denote the nodeobtained by combining real nodes sx

i ; sxþ1i ; . . . ; sy�1

i and syi . For the virtual node ux;y

i , we need to computethree of its properties, which are described as follows.

Page 5: TASC: Two-attribute-set clustering through decision tree construction

934 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

(a) The number of records in the node: jux;yi j ¼

Pyz¼xjsz

i j.

Ex: u2;31 is a virtual node obtained by combing s2

1 and s31. The example in Definition 4 sets

s21 ¼ f003; 004g and s3

1 ¼ f005; 006; 007g; so; we have js21j ¼ 2; js3

1j ¼ 3 and ju2;31 j ¼ 2þ 3 ¼ 5.

(b) The volume of the node: P ðux;yi Þ ¼

Qnq¼1ð/q � pqÞ, where /q ¼ maxwfmaxðCqðsw

i ÞÞg for x 6 w 6 y andpq ¼ minwfminðCqðsw

i ÞÞg for x 6 w 6 y.

Ex: Because s21 ¼ f003; 004g and s3

1 ¼ f005; 006; 007g; we have maxðC1ðs21ÞÞ ¼ 2;

maxðC1ðs31ÞÞ ¼ 10; minðC1ðs2

1ÞÞ ¼ 1; minðC1ðs31ÞÞ ¼ 2; maxðC2ðs2

1ÞÞ ¼ 3; maxðC2ðs31ÞÞ ¼ 10;

minðC2ðs21ÞÞ ¼ 2 and minðC2ðs3

1ÞÞ ¼ 7. Accordingly; we obtain /1 ¼ maxð2; 10Þ ¼ 10;

p1 ¼ minð1; 2Þ ¼ 1; /2 ¼ maxð3; 10Þ ¼ 10 and p2 ¼ minð2; 7Þ ¼ 2. Finally;

P ðu2;31 Þ ¼ ð/1 � p1Þ � ð/2 � p2Þ ¼ ð10� 1Þ � ð10� 2Þ ¼ 72.

(c) The density of the node: Dðux;yi Þ ¼ jux;y

i j=P ðux;yi Þ.

3. The clustering algorithm

TASC is basically a decision-tree-construction algorithm. It employs a top–down and divide-and-con-quer strategy to construct a decision tree.

Fig. 2 shows the pseudo codes of TASC. Firstly, node S is examined. If S is a dense node (Definition11(b)) already, it is marked as a leaf node and the process ends. Otherwise, node S needs to be partitioned.Lines 6–8 tentatively use each classifying attribute to partition S and calculate the fitness of each attributeby measuring its diversity or the entropy. Lines 10–14 then find the candidate with the highest fitness amongthose which have at least one subset with higher density than that of S. Lines 15–16 check if any candidateattribute is found, and mark S as a leaf node if none is found. Having determined the candidate classifyingattribute, Lines 18–20 construct new sub-nodes for all Sj

i in S which are not sparse (Definition 11(a)), andthe Build_Tree is called recursively for each new sub-node.

This algorithm adopts two measures, Entropy and Diversity, for calculating the fitness degree to find themost-suitable partitioning attribute. The two measures are discussed in Section 4.1. Three partitioningmethods, minimum entropy partitioning (MEP), equal-width binary partitioning (EWP) and equal-depth bin-

ary partitioning (EDP), are explained in Sections 3.2–3.4 respectively.The entropy measure is applied to the MEP, while the diversity measure is with the latter two methods,

EWP and EDP.

3.1. Two measures of fitness

3.1.1. Entropy measureSuppose that the records in node S are marked with label �Y,� and suppose that there are jSj virtual re-

cords with label �N� spreading uniformly across the space region of S. If we choose a certain point on theclassifying property Ai to be partitioned, the original space would be partitioned into two sub-spaces, wherethe two sub-spaces might both contain some records marked with Y and some marked with N.

The number of records with label Y in the sub-space PðsjiÞ would be jsj

i j. As to the virtual records of labelN, since they spread uniformly across the space S, the number of records in P ðsj

iÞ with label N isDðSÞ � P ðsj

iÞ. With these values, we calculate the entropy value for a given slitting point on a certain attri-bute. Finally, the cutting point with the lowest entropy value among all attributes is selected to partition thecurrent node. Since the entropy theory is widely known, we skip its details. For details, readers may refer toQuinlan (1993, 1996) and Ruggieri (2002).

Page 6: TASC: Two-attribute-set clustering through decision tree construction

Fig. 2. The pseudo code of the algorithm.

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 935

3.1.2. Diversity measure

When a node S is cut into several sub-nodes, the densities of some sub-nodes will be higher than that ofnode S, while some others will be lower. The strategy is to find the dense regions which are likely to becomeclusters, and filtered out the sparse regions from consideration.

The density of node S is noted as D(S). Having partitioned by Ai the jth sub-node of S has the followingresults: nj

i ¼ DðSÞ � P ðsjiÞ is the expected number of records and jsj

i j is the actual number of records. Accord-ing to the above strategy, the larger the value of jjsj

i j � nji jj becomes, the better the result gets. Furthermore,

since each branch has a different number of records, we multiply the deviation jjsji j � nj

i jj by weightwj

i ¼ ðjsji j=jSjÞ. The sum of the weighted changes in all sub-nodes is defined as its diversity. Therefore,

we have the diversityPk

j¼1ðjjsji j � nj

i j � jsji j=jSjÞ.

3.2. Minimum entropy partitioning (MEP)

Since all given attributes are numerical, we choose the best partitioning point from the mid points of allpairs of two successive adjacent values. The mid point that possesses the smallest entropy value is to be thereal cutting point of the classifying attribute (Quinlan, 1996). In this study, for the sake of efficiency, weemploy grid-based method introduced in Cheng et al. (1999) and Agrawal et al. (1999).

The classifying attribute is partitioned into a fixed number of intervals of the same length. Then theentropy value of each cutting point is calculated. Lastly, we pick the point with the smallest entropy value

Page 7: TASC: Two-attribute-set clustering through decision tree construction

936 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

to be the real cutting point of the classifying attribute. In this method, the larger the value range of theclassifying attribute is, the more the cutting points are to be calculated.

Thus, this method produces two sub-nodes: the records in the first sub-node have the values smaller thanthe cutting point, while the second sub-node includes the records in S whose values are no less than thecutting point.

3.3. Equal-width binary partitioning (EWP)

EWP contains two stages: partitioning stage and merging stage. In the first stage, we repetitively partitioneach interval into two equal-length non-overlapped intervals until no further partition is possible. Then thesecond phase repetitively merges the pair of adjacent intervals if the newly merged interval gains better re-sult. The task stops when there is no possible improvement. Details of the method are explained in the fol-lowing paragraphs.

3.3.1. Stage 1: Partitioning stage

Suppose we use the classifying attribute Ai to partition node S into two sub-nodes, s1i and s2

i . If the den-sity of any sub-node is increased by more than d%, then there are two possibilities: (1) if the sub-node is adense node, then we stop expanding the sub-node, for it becomes a leaf node; (2) otherwise, we further par-tition the interval associated with the sub-node. Here, we further partition the sub-node in an attempt tofind dense nodes from its descendants.

As shown in Fig. 3, the classifying attribute Ai is firstly employed to cut the node S into two sub-nodes s1i

and s2i . If the increase of the density of s1

i is higher than d but the leaf-node condition has not been reached,s1

i becomes a new node and Binary_Cut is called recursively to cut s1i . Likewise, the same procedure is ap-

plied to s2i . The process is recursively executed until the leaf-node condition is reached.

3.3.2. Stage 2: Merging stage

The second stage improves the outcome of last stage by merging pairs of adjacent intervals if the mergedpair has higher density. The process of merging is as follows:

(A) For every two adjacent nodes, calculate the density of their combination.(B) Find those combinations whose densities are higher than both of their constituent nodes.(C) Select the combination with the highest density, and generate the node Y by combining its constituent

nodes.

Fig. 3. Partitioning stage of EWP method.

Page 8: TASC: Two-attribute-set clustering through decision tree construction

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 937

(D) For each node X adjacent to Y, calculate the density of the combination of X and Y. Note that thereare at most two nodes adjacent to Y.

(E) Repeat the loop from (B) to (D) until no possible combination is found.

Let us use Fig. 4 as an example. Ai is used to cut the interval into seven sub-intervals as well as seven sub-nodes. Suppose we, according to Definition 12, generate six virtual nodes, from u2;3

i to u6;7i , where ua;b

i is thevirtual node obtained by combining intervals from the ath interval to the bth interval. Assume that the den-sity of the virtual node u2;3

i is higher than that of either s2i or s3

i ; the density of u4;5i is higher than that of

either s4i or s5

i . Then, we have two combinations that can be considered for merging. In Fig. 4, these twocombinations are marked with ‘‘H’’.

Further assume that the density of u2;3i is higher than that of u4;5

i . So, we will actually merge the secondand the third intervals, and that means u2;3

i now becomes an actual node. Since u2;3i has two adjacent inter-

vals, the first interval and the fourth interval, we need to re-compute their combinations. The combinationwith the first interval produces virtual node u1;3

i , and the combination with the fourth interval produces u2;4i .

In Fig. 5, we show the situation after the actual combination, and we use ‘‘J’’ to indicate that these twointervals have been actually merged.

Fig. 4. u2;3i and u4;5

i can be considered for merging.

Fig. 5. After merging u2;3i .

Page 9: TASC: Two-attribute-set clustering through decision tree construction

938 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

3.4. Equal-depth binary partitioning (EDP)

The process of EDP is similar to that of EWP. They use the same measure and the same partition meth-od. The only difference lies in how they partition an interval into two sub-intervals. EWP partitions aninterval into two sub-intervals of the same length, while EDP partitions with the same number of data re-cords. For example, if node S has eight records, whose values on attribute Ai are 1, 2, 4, 4, 5, 6, 10, and 12.Then EWP will take the average value of the lowest value and highest value among as the partition point (inthis example, 6.5), while EDP takes the median value as the partition point (in this example, 4.5). Sincethese two methods are similar to each other, we omit the details.

4. Experiments

Though the goal of a clustering process is to maximize the similarity within a cluster and the dissimilaritybetween clusters, the expected and obtained results often differ from one to another, due to different se-lected attributes. In this study, we use two attribute sets for partitioning the dataset and calculating the sim-ilarity. Thus, the goal is not only to maintain good similarity within a cluster but also to maximize theaccuracy obtained by the decision tree.

Three programs, MEP, EWP, and EDP are tested on a Celeron 1.7G Windows-2000 system with768 MB of main memory and JVM (J2RE 1.3.1-b24) as the Java execution environment.

The evaluation of efficiency and accuracy is done with synthetic data sets and reported in Sections 4.1and 4.2 (please read Appendix A for details of the generation of synthetic data). A decision tree createdaccording to a real data set is presented in Section 4.3. The discussion about the experiments is in Section4.4.

4.1. Efficiency evaluation

In this experiment, we set m = 8, n = 8, mc = 6, nc = 6, k = 8, c = 5 but leave jXj and dis as variants, inorder to compare the runtime of the three methods for different numbers of records. In both normaldistribution and uniform cases, as the number of data records increases, the runtime of EDP and EWPmethods only increase slightly but that of MEP increases significantly. The result are shown in Figs. 6and 7.

010002000300040005000600070008000

100000 200000 300000 400000

EDP

500000

Tim

e (s

)

Number of records

EWP

MEP

Fig. 6. Normal distribution: run time vs. number of records.

Page 10: TASC: Two-attribute-set clustering through decision tree construction

01000200030004000500060007000

100000 200000300000 500000

EDP

400000Number of records

Tim

e (s

)

EWP

MEP

Fig. 7. Uniform distribution: run time vs. number of records.

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 939

4.2. Accuracy evaluation

In this section, we compare the accuracy of the three methods against different datasets. In order to mea-sure the performance of the methods, we define the calculation of accuracy in the next paragraph. Thisaccuracy is used for evaluating the performance of the three methods.

The accuracy of this experiment is calculated as following: after clusters are formed, we calculate the dis-tance between each data object and its assigned cluster. All data objects within the minimum distance areconsidered as correct clustering. The accuracy rate as the ratio of the number of data objects that are clus-tered correctly to the total number of data objects.

From the experiment, we found that the accuracy is independent from the size of datasets. For normaldistribution datasets, as seen in Fig. 8, MEP gets the best accuracy of around 0.8, while the other two aresimilar to each other, varying from 0.45 to 0.65. Fig. 9 shows the result of the case of uniform distribution.In that experiment, all three methods remain acceptable accuracy as the size of dataset grows.

On the otherhand, from the nature of the two-set-problem, the dependency between the two attribute setsis also an important factor that affects the obtained accuracy. The dependency varies when different datasetsbeing used or different attribute sets being selected. A hint for attribute selection is given in Section 4.4.2.

4.3. The TASC tree built from real dataset

In order to give a lucid example, we use the hitter file obtained from http://lib.stat.cmu.edu/databsets/baseball.data to build up a TASC tree. The hitter file consists of data on the regular and leading substitutehitters in 1986 and 1987, and contains 24 attributes and 322 data objects. Among the 24 attributes, 7 non-numerical attributes are removed from our experimental data due to the limitation of our algorithm,

0

0.2

0.4

0.6

0.8

1

100000 200000 300000 400000 500000

EDP

Accu

racy

rate

Number of records

EWP

MEP

Fig. 8. Normal distribution: accuracy vs. number of records.

Page 11: TASC: Two-attribute-set clustering through decision tree construction

0

0.2

0.4

0.6

0.8

1

100000 200000 300000 400000 500000Number of records

Accu

racy

rate

EWP

MEP

EDP

Fig. 9. Uniform distribution: accuracy vs. number of records.

940 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

namely, hitter�s name, league, division, team, position, etc. Also, 61 of the data objects with missing valuesin the data file are deleted. The final dataset contains 17 attributes and 261 data objects. In the experiment,we take 6 attributes for classification and all 17 attributes as clustering attributes. The description of theattributes is shown in Table 2.

We use EWP to construct this TASC tree. Four parameters are set as: a = 0.001, b = 40, c = 0.04, andd = 0.3. In the result, five clusters are found with the accuracy of 0.69. The TASC tree is shown in Fig. 10.

4.4. Discussion on experiments

4.4.1. Comparison of the three methods

(A) EDP and EWP perform much faster than MEP.(B) For normal distribution datasets, MEP has the highest accuracy rate, followed by EDP and EWP.

And for uniform distribution datasets, the three methods perform roughly the same in accuracy.There is no significant difference on accuracy between normal distribution and uniform distributiondatasets.

Table 2Dataset attribute description

Attribute description Classification attribute Clustering attribute

Number of times at bat in 1986 A1 C1Number of hits in 1986 A2 C2Number of home runs in 1986 A3 C3Number of runs in 1986 A4 C4Number of runs batted in 1986 A5 C5Number of walks in 1986 A6 C6Number of years in the major leagues – C7Number of times at bat during his career – C8Number of hits during his career – C9Number of home runs during his career – C10Number of runs during his career – C11Number of runs batted during his career – C12Number of walks during his career – C13Number of put outs in 1986 – C14Number of assists in 1986 – C15Number of errors in 1986 – C161987 Annual salary on opening day in USD – C17

Page 12: TASC: Two-attribute-set clustering through decision tree construction

[354, 687][19, 351]A1

[0, 31]

[14, 20]

Cluster1

[21, 27]

[28, 55]

[43, 69]

[0, 6]

A2

A3

32

A4

4

[32, 64]

5

A5

Size: 11 Size: 18 Size: 13 Size: 43 Size: 157

Cluster Cluster Cluster Cluster

Fig. 10. A TASC tree built from a baseball player dataset.

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 941

(C) From the experiments, we find that EDP is the most suitable method of the three since it takes lessruntime yet maintains good accuracy. On the contrary, if the accuracy is most concerned, thenMEP is the choice.

4.4.2. Attribute selection

There are two attribute sets need to be defined: the classifying attribute set and the clustering attributeset. The clustering attribute set reflects the user�s interest on the dataset. Therefore, users may choose theattributes according to the subject of the clustering task. For the classifying attribute set, users shouldchoose the attributes which are relevant to those of the clustering attribute set. Those irrelevant attributeswill not produce good results. Alternatively, since the algorithm selects the most fitted attribute automat-ically, users may put all attributes in the classifying attribute set if the runtime is not concerned.

4.4.3. Sensitivity of user-defined parameters

In the experiments, we set the parameters as the following:

0.001 6 a < 0.1, increased by *10;10 6 b 6 1000, increased by 10;0.01 6 c 6 0.1, increased by 0.1;0.1 6 d 6 1, increased by 0.1, and 0.01 6 d 6 0.1, increased by 0.01.

The sensitivity of the four user-defined parameters for each method is concluded as follows:EWP (four parameters: a, b, c, and d)

• When d = 0.3, we find the best accuracy; and when d > 0.5, the result is incorrect.• The accuracy reaches its highest value when c = 0.06 or 0.1, and b = 10.• b and c are independent to each other. When b and c are set to proper values, a acts no function to the

result.

Page 13: TASC: Two-attribute-set clustering through decision tree construction

942 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

EDP (four parameters: a, b, c, and d)

• We find that the accuracy is not sensitive to the value of a and d, i.e., the effect of the a and d is notsignificant.

• There is no certain value for c and b which makes the best accuracy. The accuracy varies slightly when bchanges from 10, 100, to 1000, yet the differences are always below 0.05. On the otherhand,when c < 0.04, the accuracy remains below 0.5, therefore, the suggested value of c shall be no less than0.04.

MEP (three parameters: a, b, and c)

• We find that the accuracy is not sensitive to the value of a, i.e., the effect of the a is not significant.• When c > 0.09, the accuracy drops below 0.6, but when b is properly set, the accuracy may raise above

0.6. Therefore, we suggest to firstly adjust the value of c until the accuracy reaches its best, then to tunethe value of b to obtain better result.

5. Conclusion

Most of the existing clustering algorithms consider only one attribute set and cannot be applied to theproblem when two attribute sets are concerned. Our work relaxes this constraint so that the classifyingattributes and the clustering attributes can be the same, partly different, or totally different. Two attributesets are considered simultaneously in the process of clustering.

In this paper, firstly we define the classifying attributes and the clustering attributes, and then we definethe characteristics of nodes and sub-nodes, including the number of the records in a node, the space volumeof a node, and the density of a node. Lastly, we define dense nodes and sparse nodes. Having given thesedefinitions, we propose an algorithm with three variants. All these three variants are capable of clusteringdata into clusters with two different attribute sets. To evaluate this algorithm, we measure the efficiency andaccuracy of three variants of the algorithm. Also, we demonstrate the capability of this algorithm by apply-ing it to a real data set.

The following are four possible extensions in the future:

1. Due to the ‘‘two-attribute-set’’ design, the accuracy obtained from the algorithm is dependent on thedegree of dependency between the two attribute sets. Thus, future research could attempt to explorehow the dependency between the two sets of attribute may affect the accuracy.

2. In the paper, we only perform sensitivity analysis based on synthetic data sets. In the future, we mayconduct sensitivity analysis from a mixture of real world data sets rather than the artificially generatedtest data alone. Or we may consider conducting sensitivity analysis on a wider spectrum of data sets.Either way could provide for us a more thorough understanding of how the clustering results may beinfluenced by data set with different properties.

3. The three algorithms proposed in this paper all assume that the attributes are numerical attributes. Thisassumption is usually not the case in a real-life application, where we may encounter other kinds of attri-butes such as categorical, Boolean or nominal. Thus it would be worth considering how to design newtwo-attribute-set clustering algorithms that can cluster data with non-numerical data attributes.

4. In this paper, we developed the clustering algorithms based on the density-based approach. But besidesdensity-based approach, in the past other approaches were also used to develop clustering algorithms,including distance-based, partition-based, hierarchical-based, model-based and grid-based approaches.Thus, in the future we might try to employ other approaches to solve the problem.

Page 14: TASC: Two-attribute-set clustering through decision tree construction

Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944 943

Acknowledgement

The research was supported in part by MOE Program for Promoting Academic Excellence of Univer-sities under the Grant No. 91-H-FA07-1-4.

Appendix A. Synthetic data generation

We modify the data generation method developed in Liu et al. (2000) to produce the synthetic data. Inthis study, we use two different sets of attributes to do classifying and clustering. The values of all dimen-sions are between 0 and 100 ([0, 100]). To create a cluster, we first pick nc/mc dimensions from the classifyingand the clustering attributes, and then randomly choose two numbers in [0, 100] for each of these dimen-sions. These two numbers decide the range in a dimension in which the cluster exists (different clustersmight have some common dimensions and common ranges, but they do not have the same ranges in alldimensions). For the other unpicked dimensions, we assume that the values of a cluster spread uniformlyin the range of [0, 100]. Having defined the ranges for every cluster, we assume that the number of records ineach cluster is (1 � e) * X/c. Then, we generate the data for each cluster in their ranges according to eithernormal distribution or uniform distribution. Finally, we generate e * X noise records outside the range ofthe cluster.

Table A1 lists all the parameters used in generating the synthetic data. There are totally 26 combinationsof parameters in the experiments, which is shown in Table A2. Let us use the data set X100000–m8–n8–k4–

Table A1The parameters used in generating the synthetic data

jXj The number of recordsm The number of classifying attributesn The number of clustering attributesk The number of common attributesc The number of clustersmc The number of classifying attributes in a clusternc The number of clustering attributes in a clustere The ratio of noise data, we fix it as 1%dis Data distributionN Normal distributionU Uniform distribution

Table A2Combinations of parameters in the experiment

X100000–m8–n8–k8–c5–mc6–nc6–disN X100000–m8–n8–k8–c5–mc6–nc6–disU

X200000–m8–n8–k8–c5–mc6–nc6–disN X200000–m8–n8–k8–c5–mc6–nc6–disU

X300000–m8–n8–k8–c5–mc6–nc6–disN X300000–m8–n8–k8–c5–mc6–nc6–disU

X400000–m8–n8–k8–c5–mc6–nc6–disN X400000–m8–n8–k8–c5–mc6–nc6–disU

X500000–m8–n8–k8–c5–mc6–nc6–disN X500000–m8–n8–k8–c5–mc6–nc6–disU

X100000–m8–n8–k8–c10–mc6–nc6–disN X100000–m8–n8–k8–c10–mc6–nc6–disU

X100000–m8–n8–k8–c15–mc6–nc6–disN X100000–m8–n8–k8–c15–mc6–nc6–disU

X100000–m8–n8–k8–c20–mc6–nc6–disN X100000–m8–n8–k8–c20–mc6–nc6–disU

X100000–m8–n8–k8–c25–mc6–nc6–disN X100000–m8–n8–k8–c25–mc6–nc6–disU

X100000–m8–n8–k6–c5–mc6–nc6–disN X100000–m8–n8–k6–c5–mc6–nc6–disU

X100000–m8–n8–k4–c5–mc6–nc6–disN X100000–m8–n8–k4–c5–mc6–nc6–disU

X100000–m8–n8–k2–c5–mc6–nc6–disN X100000–m8–n8–k2–c5–mc6–nc6–disU

X100000–m8–n8–k0–c5–mc6–nc6–disN X100000–m8–n8–k0–c5–mc6–nc6–disU

Page 15: TASC: Two-attribute-set clustering through decision tree construction

944 Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944

c5–mc6–nc6–disN for explanation: (1) X100000 indicates that the data set has 100,000 data records; (2) m8–n8–k4 means we have eight classifying attributes and eight clustering attributes, among which four attri-butes are common, i.e., these four are used both as classifying and clustering attributes; so, in total we have8 + 8 � 4 = 12 attributes; (3) c5 means there are five clusters in total; (4) mc6–nc6–disN means the data ofeach cluster are spreading in specific ranges along six classifying attributes and six clustering attributes inaccordance with the normal distribution.

References

Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J., 1999. OPTICS: Ordering points to identify clustering structure. In: Proceedings ofthe ACM SIGMOD Conference, Philadelphia, PA, pp. 49–60.

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1999. Automatic subspace clustering of high dimensional data for data miningapplications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105.

Basak, J., Krishnapuram, R., 2005. Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEETransactions on Knowledge and Data Engineering 17 (1), 121–132.

Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York.Chen, Y.L., Hsu, C.L., Chou, S.C., 2003. Constructing a multi-valued and multi-labeled decision tree. Expert Systems with

Applications 25 (2), 199–209.Cheng, C.H., Fu, A.W., Zhang, Y., 1999. Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93.Friedman, J.H., Fisher, N.I., 1999. Bump hunting in high-dimensional data. Statistics and Computing 9 (2), 123–143.Grabmeier, J., Rudolph, A., 2002. Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery 6 (4), 303–

360.Guha, S., Rastogi, R., and Shim, K., 1998. CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM

SIGMOD Conference, Seattle, WA, pp. 73–84.Han, J., Kamber, M., 2001. Data mining: Concepts and Techniques. Academic Press.Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: A review. ACM Computing Surveys 31 (3), 264–323.Kantardzic, M., 2002. Data mining: Concepts, Models, Methods, and Algorithms. Wiley-IEEE Press.Karypis, G., Han, E.-H., Kumar, V., 1999. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. Computer

32, 68–75.Keim, D., Hinneburg, A., 1999. Clustering techniques for large data sets: From the past to the future. KDD Tutorial Notes 1999,

pp. 141–181.Klawonn, F., Kruse, R., 1997. Constructing a fuzzy controller from data. Fuzzy Sets and Systems 85, 177–193.Liu, B., Xia, Y., Yu, P., 2000. Clustering through decision tree construction. In: Proceedings of the 2000 ACM CIKM International

Conference on Information and Knowledge Management, pp. 20–29.Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.Quinlan, J.R., 1996. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4, 77–90.Ruggieri, S., 2002. Efficient C4.5. IEEE Transactions Morgan Kaufmann Knowledge and Data Engineering 14 (2), 438–444.Yao, Y.Y., 1998. A comparative study of fuzzy sets and rough sets. Journal of Information Sciences 109, 227–242.