efficiently clustering transactional data with weighted coverage density

24
111/06/27 1 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006 報報報 : 報報報

Upload: hollye

Post on 18-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Efficiently Clustering Transactional data with Weighted Coverage Density. M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006. 報告人 : 吳建良. Outline. Motivation SCALE Framework BKPlot Method - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficiently Clustering Transactional data with Weighted Coverage Density

112/04/21 1

Efficiently Clustering Transactional data with Weighted

Coverage Density

M. Hua Yan , Keke Chen, and Ling Liu

Proceedings of the 15th International Conference on Information and Knowledge Management, ACM CIKM, 2006

報告人 : 吳建良

Page 2: Efficiently Clustering Transactional data with Weighted Coverage Density

2

Outline Motivation SCALE Framework BKPlot Method WCD Clustering Algorithm Cluster Validity Evaluation Experimental Results

Page 3: Efficiently Clustering Transactional data with Weighted Coverage Density

3

Motivation Transactional data is a kind of special categorical data

t1={milk, bread, beer}, t2={milk, bread} Can be transformed to row by column table with Boolean value Large volume and high dimensionality make the existing

algorithms inefficient to process the transformed data

Clustering transactional data algorithm: LargeItem, CLPOE, CCCD Require users to manually tune at least one or two parameters Setting these parameters are different from dataset to dataset

Page 4: Efficiently Clustering Transactional data with Weighted Coverage Density

SCALE Framework

ACE & BkPlot (SSDBM’05) ACE: Agglomerative Categorical

clustering with Entropy criterion BkPlot:

Examine the entropy difference between the clustering structures with varying K

Reports the Ks where the clustering stricture changes dramatically

Evaluation Metrics LISR: Large Item Size Ratio AMI: Average pair-clusters Merging

Index

4

Page 5: Efficiently Clustering Transactional data with Weighted Coverage Density

ACE Algorithm Bottom-up process

Initially, each record is a cluster Iteratively, find the most similar pair of clusters Cp and Cq,

and then merge them Incremental entropy

The most similar pair of clusters

is minimum among all possible pairs

denote the Im value in forming the K-cluster partition from the K+1-cluster partition

5

))(ˆ)(ˆ()(ˆ)(),( qqppqpqpqpm CHnCHnCCHnnCCI

)(KmI

),( qpm CCI

Page 6: Efficiently Clustering Transactional data with Weighted Coverage Density

BkPlot Increasing rate of entropy:

N: total records, d: columns Small increasing rate

Merging does not introduce any impurity to the clusters Clustering structure is not significantly changed

Large increasing rate Introduce considerable impurity into the partitions Clustering structure can be changed significantly

6

)(1)( K

mINd

KI

Page 7: Efficiently Clustering Transactional data with Weighted Coverage Density

BkPlot (contd.)

Relative changes Use relative changes to determine if a globally

significant clustering structure emerges7

I(K)≈I(K+1), but I(K-1)>I(K)

Page 8: Efficiently Clustering Transactional data with Weighted Coverage Density

BkPlot (contd.)

8)()1()(

)1()()(2 KIKIKI

KIKIKI

Entropy Characteristic Graph (ECG) Second-order differential of ECG: )(2 KI

Page 9: Efficiently Clustering Transactional data with Weighted Coverage Density

WCD Clustering Algorithm Notations

D: transactional dataset N: size of dataset I={I1, I2,…, Im}: a set of items

tj={Ij1, Ij2,…, Ijl}: a transaction

A transaction clustering result CK={C1, C2,…,CK} is a partition of D, where

9

jiiK CCCDCC , ,1

Page 10: Efficiently Clustering Transactional data with Weighted Coverage Density

Intra-cluster Similarity Measure Coverage Density (CD)

Given a cluster Ck

Mk: Number of distinct items

: Items set of Ck

Nk : Number of transaction in Ck

Sk: Sum occurrences of all items in Ck

10

area rectangle

cells filled

},,,{ 21 kkMkkk IIII

kk

M

j kj

kk

kk MN

Ioccur

MN

SCCD

k

1)(

)( CD↑, compactness ↑

Page 11: Efficiently Clustering Transactional data with Weighted Coverage Density

Intra-cluster Similarity Measure (contd.)

Drawback of CD Insufficient to measure the density of frequent

itemset Each item has equal contribution in a cluster

Two clusters may have the same CD but different filled-cell distribution

11

kj M

W1

a b c a b c

9

5CD

Page 12: Efficiently Clustering Transactional data with Weighted Coverage Density

Intra-cluster Similarity Measure (contd.)

Weighted Coverage Density (WCD) Focus on high-frequency items Define Wj as

12

1 . )(

1

kM

j jk

kjj Wst

S

IoccurW

kk

M

j kj

k

kjM

j kjk

j

M

j kjk

k

SN

Ioccur

S

IoccurIoccur

N

WIoccurN

CWCD

k

k

k

2

1

1

1

)(

)()(

1

)(1

)(

a b c a b c

CD WCD

3

1

6

3

6

2

6

1

Page 13: Efficiently Clustering Transactional data with Weighted Coverage Density

Clustering Criterion Expected Weighted Coverage Density (EWCD)

Clustering algorithm try to maximize the EWCD When every individual transaction is considered as a

cluster, it will get the maximum EWCD=1 Use BKPlot method to generate a set of candidate “best Ks”

13

K

k k

M

j kjK

kk

kK

S

Ioccur

NCWCD

N

NCEWCD

k

1

1

2

1

)(1)()(

Page 14: Efficiently Clustering Transactional data with Weighted Coverage Density

WCD Clustering Algorithm

14

Input: Dataset D, Number of clusters K, Initial K seedsOutput: K clusters

/* Phase 1 – Initialization*/K seeds form the initial K clusters;while not end of D do read one transaction t from D; add t into Ci that maximizes EWCD; write <t, i> back to D;

/* Phase 2 – Iteration*/while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read <t, i>; if moving t to cluster Cj increases EWCD and i ≠ j

moveMark = true; write <t, j> back to D;

Page 15: Efficiently Clustering Transactional data with Weighted Coverage Density

Cluster Validity Evaluation LISR (Large Item Size Ratio)

Measure the preservation of frequent itemsets , where LSk is #Large Items in Ck

high concurrences of items

high possibility of finding more frequent

itemsets at user-specified minimum support

15

K

kk

kk

S

LS

N

NLISR

1

LISR

Page 16: Efficiently Clustering Transactional data with Weighted Coverage Density

Cluster Validity Evaluation (contd.)

Inter-cluster dissimilarity between Ci and Cj

16

)()()(),( jijji

ji

ji

iji CCCDCCD

NN

NCCD

NN

NCCd

))11

()11

((1

)(1

)(),(

ijjj

ijii

ji

ij

ji

j

j

i

i

ji

ijji

ji

jj

j

ji

j

ii

i

ji

iji

MMS

MMS

NN

M

SS

M

S

M

S

NN

MNN

SS

MN

S

NN

N

MN

S

NN

NCCd

simplify

, where Mij is the number of distinct items after merging two cluster

thus Mij max{≧ Mi, Mj}

Because of and , d(Ci, Cj) is a real number between 0 and 1 iij MM

11

jij MM

11

Page 17: Efficiently Clustering Transactional data with Weighted Coverage Density

Cluster Validity Evaluation (contd.)

Example If Mi=Mj=Mij, then d(Ci,Cj)=0

Mi=Mj=3, Mij=5

17

a b c

Ci Cj

3

1))

5

1

3

1(4)

5

1

3

1(5(

6

1 ),( ji CCd

a b c

a b c

Ci Cj

c d e

Page 18: Efficiently Clustering Transactional data with Weighted Coverage Density

Cluster Validity Evaluation (contd.)

AMI (Average pair-clusters Merging Index) Evaluate the overall inter-dissimilarity of a

clustering result having K clusters

better the clustering quality

18

},,,1,),,(min{

,1

1

jiKjiCCdD

DK

AMI

jii

K

i i

AMI

Page 19: Efficiently Clustering Transactional data with Weighted Coverage Density

Experiments Dataset

Tc30a6r1000 1000 records, 30 column, 6 possible attribute values

Zoo 101 records, 18 attributes

Mushroom 8124 instances, 22 attributes

Mushroom100k Sample the mushroom data with duplicates 100,000 instances

TxI4Dx IBM Data Generator 19

Page 20: Efficiently Clustering Transactional data with Weighted Coverage Density

Experimental Results Tc30a6r1000

20

The repulsion parameter r of CLOPE iscontrolling the number of clusters

5 clusters 9 clusters

Page 21: Efficiently Clustering Transactional data with Weighted Coverage Density

Experimental Results (contd.)

Zoo: K=7 is the best

21

2 clusters 4 clusters 7 clusters

Page 22: Efficiently Clustering Transactional data with Weighted Coverage Density

Experimental Results (contd.)

Mushroom: K=19 is the best

22

Page 23: Efficiently Clustering Transactional data with Weighted Coverage Density

Experimental Results (contd.)

Performance evaluation on mushroom100k

23

r=0.5~4.0 r=2.0

Page 24: Efficiently Clustering Transactional data with Weighted Coverage Density

Experimental Results (contd.)

Performance evaluation on TxI4Dx

24

T10I4Dx TxI4D100k