efficiently clustering transactional data with weighted coverage density

112/04/21 1

Efficiently Clustering Transactional data with Weighted

Coverage Density

M. Hua Yan , Keke Chen, and Ling Liu

Proceedings of the 15th International Conference on Information and Knowledge Management, ACM CIKM, 2006

報告人 : 吳建良

2

Outline Motivation SCALE Framework BKPlot Method WCD Clustering Algorithm Cluster Validity Evaluation Experimental Results

3

Motivation Transactional data is a kind of special categorical data

t1={milk, bread, beer}, t2={milk, bread} Can be transformed to row by column table with Boolean value Large volume and high dimensionality make the existing

algorithms inefficient to process the transformed data

Clustering transactional data algorithm: LargeItem, CLPOE, CCCD Require users to manually tune at least one or two parameters Setting these parameters are different from dataset to dataset

SCALE Framework

ACE & BkPlot (SSDBM’05) ACE: Agglomerative Categorical

clustering with Entropy criterion BkPlot:

Examine the entropy difference between the clustering structures with varying K

Reports the Ks where the clustering stricture changes dramatically

Evaluation Metrics LISR: Large Item Size Ratio AMI: Average pair-clusters Merging

Index

4

ACE Algorithm Bottom-up process

Initially, each record is a cluster Iteratively, find the most similar pair of clusters Cp and Cq,

and then merge them Incremental entropy

The most similar pair of clusters

is minimum among all possible pairs

denote the Im value in forming the K-cluster partition from the K+1-cluster partition

5

))(ˆ)(ˆ()(ˆ)(),( qqppqpqpqpm CHnCHnCCHnnCCI

)(KmI

),( qpm CCI

BkPlot Increasing rate of entropy:

N: total records, d: columns Small increasing rate

Merging does not introduce any impurity to the clusters Clustering structure is not significantly changed

Large increasing rate Introduce considerable impurity into the partitions Clustering structure can be changed significantly

6

)(1)( K

mINd

KI

BkPlot (contd.)

Relative changes Use relative changes to determine if a globally

significant clustering structure emerges7

I(K)≈I(K+1), but I(K-1)>I(K)

BkPlot (contd.)

8)()1()(

)1()()(2 KIKIKI

KIKIKI

Entropy Characteristic Graph (ECG) Second-order differential of ECG: )(2 KI

WCD Clustering Algorithm Notations

D: transactional dataset N: size of dataset I={I1, I2,…, Im}: a set of items

tj={Ij1, Ij2,…, Ijl}: a transaction

A transaction clustering result CK={C1, C2,…,CK} is a partition of D, where

9

jiiK CCCDCC , ,1

Intra-cluster Similarity Measure Coverage Density (CD)

Given a cluster Ck

Mk: Number of distinct items

: Items set of Ck

Nk : Number of transaction in Ck

Sk: Sum occurrences of all items in Ck

10

area rectangle

cells filled

},,,{ 21 kkMkkk IIII

kk

M

j kj

kk

kk MN

Ioccur

MN

SCCD

k

1)(

)( CD↑, compactness ↑

Intra-cluster Similarity Measure (contd.)

Drawback of CD Insufficient to measure the density of frequent

itemset Each item has equal contribution in a cluster

Two clusters may have the same CD but different filled-cell distribution

11

kj M

W1

a b c a b c

9

5CD

Intra-cluster Similarity Measure (contd.)

Weighted Coverage Density (WCD) Focus on high-frequency items Define Wj as

12

1 . )(

1

kM

j jk

kjj Wst

S

IoccurW

kk

M

j kj

k

kjM

j kjk

j

M

j kjk

k

SN

Ioccur

S

IoccurIoccur

N

WIoccurN

CWCD

k

k

k

2

1

1

1

)(

)()(

1

)(1

)(

a b c a b c

CD WCD

3

1

6

3

6

2

6

1

Clustering Criterion Expected Weighted Coverage Density (EWCD)

Clustering algorithm try to maximize the EWCD When every individual transaction is considered as a

cluster, it will get the maximum EWCD=1 Use BKPlot method to generate a set of candidate “best Ks”

13

K

k k

M

j kjK

kk

kK

S

Ioccur

NCWCD

N

NCEWCD

k

1

1

2

1

)(1)()(

WCD Clustering Algorithm

14

Input: Dataset D, Number of clusters K, Initial K seedsOutput: K clusters

/* Phase 1 – Initialization*/K seeds form the initial K clusters;while not end of D do read one transaction t from D; add t into Ci that maximizes EWCD; write <t, i> back to D;

/* Phase 2 – Iteration*/while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read <t, i>; if moving t to cluster Cj increases EWCD and i ≠ j

moveMark = true; write <t, j> back to D;

Cluster Validity Evaluation LISR (Large Item Size Ratio)

Measure the preservation of frequent itemsets , where LSk is #Large Items in Ck

high concurrences of items

high possibility of finding more frequent

itemsets at user-specified minimum support

15

K

kk

kk

S

LS

N

NLISR

1

LISR

Cluster Validity Evaluation (contd.)

Inter-cluster dissimilarity between Ci and Cj

16

)()()(),( jijji

ji

ji

iji CCCDCCD

NN

NCCD

NN

NCCd

))11

()11

((1

)(1

)(),(

ijjj

ijii

ji

ij

ji

j

j

i

i

ji

ijji

ji

jj

j

ji

j

ii

i

ji

iji

MMS

MMS

NN

M

SS

M

S

M

S

NN

MNN

SS

MN

S

NN

N

MN

S

NN

NCCd

simplify

, where Mij is the number of distinct items after merging two cluster

thus Mij max{≧ Mi, Mj}

Because of and , d(Ci, Cj) is a real number between 0 and 1 iij MM

11

jij MM

11


Example If Mi=Mj=Mij, then d(Ci,Cj)=0

Mi=Mj=3, Mij=5

17

a b c

Ci Cj

3

1))

5

1

3

1(4)

5

1

3

1(5(

6

1 ),( ji CCd

a b c

a b c

Ci Cj

c d e


AMI (Average pair-clusters Merging Index) Evaluate the overall inter-dissimilarity of a

clustering result having K clusters

better the clustering quality

18

},,,1,),,(min{

,1

1

jiKjiCCdD

DK

AMI

jii

K

i i

AMI

Experiments Dataset

Tc30a6r1000 1000 records, 30 column, 6 possible attribute values

Zoo 101 records, 18 attributes

Mushroom 8124 instances, 22 attributes

Mushroom100k Sample the mushroom data with duplicates 100,000 instances

TxI4Dx IBM Data Generator 19

Experimental Results Tc30a6r1000

20

The repulsion parameter r of CLOPE iscontrolling the number of clusters

5 clusters 9 clusters

Experimental Results (contd.)

Zoo: K=7 is the best

21

2 clusters 4 clusters 7 clusters


Mushroom: K=19 is the best

22


Performance evaluation on mushroom100k

23

r=0.5~4.0 r=2.0


Performance evaluation on TxI4Dx

24

T10I4Dx TxI4D100k

efficiently clustering transactional data with weighted coverage density

Documents