10clusbasic

5/17/2018 10ClusBasic

1/101

Concepts and

Techniques(3rded.)

Chapter 10

Jiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign

!imon "raser University

#$%&& Han, Kamber Pei' (ll rights reserved'

1


2/101

2

Chapter 10. Cluster Analsis: !asicConcepts and Methods

Cluster Analsis: !asic Concepts

"artitioning Methods

#ierarchical Methods Densit$!ased Methods

%rid$!ased Methods

&'aluation o Clustering

u**ar

2


3/101

3

+hat is Cluster Analsis,

Cl)ster* ( collection of data ob+ects similar or related to one another within the same gro)p dissimilar or )nrelated to the ob+ects in other gro)ps

Cl)ster analysis or clustering, data segmentation, "inding similarities between data according to the

characteristics fo)nd in the data and gro)ping similardata ob+ects into cl)sters

Uns)pervised learning* no predefined classes i'e', learningby observationsvs' learning by e.amples* s)pervised

/ypical applications (s a stand-alone toolto get insight into data distrib)tion

(s a preprocessing stepfor other algorithms


4/101

-

Clustering or Data nderstandingand Applications

0iology* ta.onomy of living things* 1ingdom, phyl)m, class, order,family, gen)s and species Information retrieval* doc)ment cl)stering 2and )se* Identification of areas of similar land )se in an earth

observation database Mar1eting* Help mar1eters discover distinct gro)ps in their c)stomer

bases, and then )se this 1nowledge to develop targeted mar1etingprograms

City-planning* Identifying gro)ps of ho)ses according to their ho)setype, val)e, and geographical location

3arth-4)a1e st)dies* 5bserved earth 4)a1e epicenters sho)ld becl)stered along continent fa)lts

Climate* )nderstanding earth climate, find patterns of atmosphericand ocean

3conomic !cience* mar1et resarch


5/101

/

Clustering as a "reprocessing Tool(tilit)

!)mmari6ation* Preprocessing for regression, PC(, classification, and

association analysis Compression*

Image processing* vector 4)anti6ation "inding K-nearest 7eighbors

2ocali6ing search to one or a small n)mber of cl)sters 5)tlier detection

5)tliers are often viewed as those 8far away9 from anycl)ster


6/101

ualit: +hat s %oodClustering,

( good cl)stering method will prod)ce high 4)alitycl)sters

high intra-class similarity* cohesivewithin cl)sters

low inter-class similarity* distinctivebetween cl)sters /he 4)ality of a cl)stering method depends on

the similarity meas)re )sed by the method

its implementation, and Its ability to discover some or all of the hidden patterns


7/101

Measure the ualit oClustering

:issimilarity;!imilarity metric !imilarity is e.pressed in terms of a distance f)nction,typically metric* di, j

/he definitions of distance f)nctionsare )s)ally ratherdifferent for interval-scaled, boolean, categorical, ordinalratio, and vector variables


8/101

ons era ons or us erAnalsis

Partitioning criteria !ingle level vs' hierarchical partitioning often, m)lti-level

hierarchical partitioning is desirable

!eparation of cl)sters

3.cl)sive e'g', one c)stomer belongs to only one region vs' non-e.cl)sive e'g', one doc)ment may belong to more than oneclass

!imilarity meas)re

:istance-based e'g', 3)clidian, road networ1, vector vs'connectivity-based e'g', density or contig)ity

Cl)stering space

")ll space often when low dimensional vs' s)bspaces often inhigh-dimensional cl)stering

4


9/101

5equire*ents and Challenges

!calability Cl)stering all the data instead of only on samples

(bility to deal with different types of attrib)tes 7)merical, binary, categorical, ordinal, lin1ed, and mi.t)re of these

Constraint-based cl)stering User may give inp)ts on constraints Use domain 1nowledge to determine inp)t parameters

Interpretability and )sability 5thers

:iscovery of cl)sters with arbitrary shape (bility to deal with noisy data Incremental cl)stering and insensitivity to inp)t order High dimensionality

6


10/101

Ma7or Clustering Approaches()

Partitioning approach* Constr)ct vario)s partitions and then eval)ate them by some

criterion, e'g', minimi6ing the s)m of s4)are errors /ypical methods* 1-means, 1-medoids, C2(>(7!

Hierarchical approach* Create a hierarchical decomposition of the set of data or ob+ects

)sing some criterion /ypical methods* :iana, (gnes, 0I>CH, C(M32357

:ensity-based approach* 0ased on connectivity and density f)nctions /ypical methods* :0!(C7, 5P/IC!, :enCl)e

?rid-based approach* based on a m)ltiple-level gran)larity str)ct)re

/ypical methods* !/I7?,


11/101

Ma7or Clustering Approaches()

Model-based* ( model is hypothesi6ed for each of the cl)sters and tries to find

the best fit of that model to each other /ypical methods*3M, !5M, C50


12/10112


Cluster Analsis: !asic Concepts

"artitioning Methods

#ierarchical Methods Densit$!ased Methods

%rid$!ased Methods


u**ar

12


13/101

"artitioning Algorith*s: !asicConcept

Partitioning method* Partitioning a database Dof nob+ects into a set of kcl)sters, s)ch that the s)m of s4)ared distances is minimi6ed where c iis the centroid or medoid of cl)ster C i

?iven k, find a partition of k clusters that optimi6es the chosenpartitioning criterion ?lobal optimal* e.ha)stively en)merate all partitions He)ristic methods* k-meansand k-medoidsalgorithms k-meansMac=)een@AB, 2loyd@B;@D$* 3ach cl)ster is represented

by the center of the cl)ster k-medoidsor P(M Partition aro)nd medoids Ka)fman

>o)ssee)w@DB* 3ach cl)ster is represented by one of the ob+ects inthe cl)ster

2

1 )( iCpk

i cpE i = =

13


14/101

The K-MeansClustering Method

?iven k, the k-meansalgorithm is implemented in fo)rsteps*

Partition ob+ects into knonempty s)bsets

Comp)te seed points as the centroids of thecl)sters of the c)rrent partitioning the centroid is

the center, i'e', mean point, of the cl)ster

(ssign each ob+ect to the cl)ster with the nearest

seed point

?o bac1 to !tep $, stop when the assignment does

not change

1-


15/101

An &8a*ple o K-MeansClustering

92

Ar;itrarilpartitiono;7ects

into epeat Comp)te centroid i'e', mean

point for each partition

(ssign each ob+ect to the

cl)ster of its nearest centroid

Until no change


16/101

Co**ents on the K-MeansMethod

!trength* Efficient* Otkn, where nis E ob+ects, kis E cl)sters, and t isE iterations' 7ormally, k, tFF n'

Comparing* P(M* 51n-1$, C2(>(* 51s$G 1n-1

Comment* 5ften terminates at a local optimal'


17/101

>ariations o the K-MeansMethod

Most of the variants of the k-meanswhich differ in !election of the initial kmeans

:issimilarity calc)lations

!trategies to calc)late cl)ster means Handling categorical data* k-modes

>eplacing means of cl)sters with modes

Using new dissimilarity meas)res to deal with categorical ob+ects Using a fre4)ency-based method to )pdate modes of cl)sters

( mi.t)re of categorical and n)merical data* k-prototypemethod

1

+ at s t e "ro e* o t e 9 Means


18/101

+ at s t e "ro e* o t e 9$MeansMethod,

/he 1-means algorithm is sensitive to o)tliers !ince an ob+ect with an e.tremely large val)e may s)bstantially

distort the distrib)tion of the data

K-Medoids* Instead of ta1ing the meanval)e of the ob+ect in a cl)ster

as a reference point, medoidscan be )sed, which is the most

centrally locatedob+ect in a cl)ster

%

&

$

A

B

D

&%

% & $ A B D &%%

&

$

A

B

D

&%

% & $ A B D &%

14


19/10116

"AM: A Tpical 9$Medoids Algorith*

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

Total Cost 2

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

92

Ar;itrarchoose

< o;7ectasinitial*edoids

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

Assigneachre*aini

ngo;7ecttonearest*edoids 5ando*l select a

non*edoido;7ect?@ra*do*

Co*putetotal costosapping

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

Total Cost 2

apping@ and@ra*do*

qualit isi*pro'ed.

Do loop

Until nochange

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10


20/101

The 9$Medoid Clustering Method

K-MedoidsCl)stering* "ind representativeob+ects medoids in cl)sters !MPartitioning (ro)nd Medoids, Ka)fmann >o)ssee)w &DB

!tarts from an initial set of medoids and iteratively replaces one

of the medoids by one of the non-medoids if it improves the total

distance of the res)lting cl)stering

!Mwor1s effectively for small data sets, b)t does not scale

well for large data sets d)e to the comp)tational comple.ity

3fficiency improvement on P(M

"#!$!Ka)fmann >o)ssee)w, &%* P(M on samples

"#!$!%&7g Han, &* >andomi6ed re-sampling

20


21/10121


Cluster Analsis: !asic Concepts "artitioning Methods

#ierarchical Methods

Densit$!ased Methods

%rid$!ased Methods

&'aluation o Clustering u**ar

21


22/101

#ierarchical Clustering

Use distance matri. as cl)stering criteria' /his methoddoes not re4)ire the n)mber of cl)sters kas an inp)t, b)tneeds a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)

22


23/101

A%B& (Agglo*erati'e Besting)

Introd)ced in Ka)fmann and >o)ssee)w &% Implemented in statistical pac1ages, e'g', !pl)s Use the single-linkmethod and the dissimilarity matri. Merge nodes that have the least dissimilarity ?o on in a non-descending fashion 3vent)ally all nodes belong to the same cl)ster

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

23


24/101

Dendrogram:Shows How Clusters areMerged

:ecompose data ob+ects into a several levels of nestedpartitioning tree of cl)sters, called a dendrogram

( cl)stering of the data ob+ects is obtained by c)ttingthe dendrogram at the desired level, then eachconnected component forms a cl)ster

2-


25/101

DABA (Di'isi'e Analsis)

Introd)ced in Ka)fmann and >o)ssee)w &%

Implemented in statistical analysis pac1ages, e'g', !pl)s

Inverse order of (?73!

3vent)ally each node forms a cl)ster on its own

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

2/


26/101

Distance ;eteenClusters

!ingle lin1* smallest distance between an element in one cl)sterand an element in the other, i'e', distKi, K+ L mintip, t+4

Complete lin1* largest distance between an element in one cl)sterand an element in the other, i'e', distKi, K+ L ma.tip, t+4

(verage* avg distance between an element in one cl)ster and anelement in the other, i'e', distKi, K+ L avgtip, t+4

Centroid* distance between the centroids of two cl)sters, i'e',

distKi, K+ L distCi, C+ Medoid* distance between the medoids of two cl)sters, i'e', distKi,

K+ L distMi, M+

Medoid* a chosen, centrally located ob+ect in the cl)ster

2


27/101

Centroid? 5adius and Dia*eter o aCluster (or nu*erical data sets)

Centroid* the 8middle9 of a cl)ster

>adi)s* s4)are root of average distance from any point

of the cl)ster to its centroid

:iameter* s4)are root of average mean s4)ared

distance between all pairs of points in the cl)ster

N

tNi ip

mC)(

1==

N

mciptN

imR

2)(1

=

=

)1(

2)(11

=

=

=

NN

iqt

iptN

iNi

mD

2


28/101

&8tensions to #ierarchical Clustering

Ma+or wea1ness of agglomerative cl)stering methods

Can never )ndo what was done previo)sly

:o not scale well* time comple.ity of at least On',

where nis the n)mber of total ob+ects Integration of hierarchical distance-based cl)stering

0I>CH &A* )ses C"-tree and incrementally ad+)sts

the 4)ality of s)b-cl)sters

CH(M32357 &* hierarchical cl)stering )sing

dynamic modeling

24

!5C# (! l d t ti 5 d i


29/101

!5C# (!alanced terati'e 5educingand Clustering sing #ierarchies)

hang, >ama1rishnan 2ivny, !I?M5:@A Incrementally constr)ct a C" Cl)stering "eat)re tree, a hierarchical

data str)ct)re for m)ltiphase cl)stering

Phase &* scan :0 to b)ild an initial in-memory C" tree a m)lti-level

compression of the data that tries to preserve the inherent cl)steringstr)ct)re of the data

Phase $* )se an arbitrary cl)stering algorithm to cl)ster the leafnodes of the C"-tree

&cales linearly* finds a good cl)stering with a single scan and improvesthe 4)ality with a few additional scans

(eakness)handles only n)meric data, and sensitive to the order of thedata record

26


30/101

Clustering eature >ector in!5C#

Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points

LS: linear sum of N points:

SS: square sum of N points

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)(4,5)

(4,7)

(3,8)

=

N

iiX

1

2

1

=

N

i

iX

30


31/101

C$Tree in !5C#

Clustering eature: u**ar o the statistics or a gi'en su;cluster: the 0$th? 1st? and 2nd *o*ents o the su;cluster ro* thestatistical point o 'ie

5egisters crucial *easure*ents or co*puting cluster

and utiliEes storage eFcientlA C tree is a height$;alanced tree that stores theclustering eatures or a hierarchical clustering A nonlea node in a tree has descendants or GchildrenH The nonlea nodes store su*s o the Cs o their

children A C tree has to para*eters

!ranching actor: *a8 I o children Threshold: *a8 dia*eter o su;$clusters stored at the

lea nodes31


32/101

The C Tree tructure

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev e!t CF1 CF2 CF4prev e!t

" = 7

# = 6

$%%t

&%'lea %de

#ea %de #ea %de

32


33/101

The !irch Algorith*

Cl)ster :iameter

"or each point in the inp)t "ind closest leaf entry (dd point to leaf entry and )pdate C" If entry diameter N ma.Odiameter, then split leaf, and possibly

parents (lgorithm is 5n

Concerns !ensitive to insertion order of data points !ince we fi. the si6e of leaf nodes, so cl)sters may not be so nat)ral Cl)sters tend to be spherical given the radi)s and diameter

meas)res

2)()1(

1j

xi

xnn

33


34/101

C#AM&=&@B: #ierarchical Clusteringsing Dna*ic Modeling (1666)

CH(M32357* ?' Karypis, 3' H' Han, and ' K)mar, & Meas)res the similarity based on a dynamic model

/wo cl)sters are merged only if the interconnectivityand closeness *proximity+between two cl)sters are

high relative tothe internal interconnectivity of thecl)sters and closeness of items within the cl)sters

?raph-based, and a two-phase algorithm

&'

Use a graph-partitioning algorithm* cl)ster ob+ects into alarge n)mber of relatively small s)b-cl)sters

$' Use an agglomerative hierarchical cl)stering algorithm*find the gen)ine cl)sters by repeatedly combining these

s)b-cl)sters 3-


35/101

@'erall ra*eor< o C#AM&=&@B

Construct (!NN)

Sparse Grap" #artition t"e Grap"

$erge #artition

Final Clusters

Data Set

K-NN Graph

" and q areconnected i q is

a*ong the top elies on a density-basednotion of cl)ster* ( clusterisdefined as a ma.imal set of density-connected points

:iscovers cl)sters of arbitrary shape in spatial databaseswith noise

C%re

"%rder

tlier

-p = 1c.

*i+t = 5

--


45/101

D!CAB: The Algorith*

(rbitrary select a pointp >etrieve all points density-reachable frompw'r't' Epsand

Mints

Ifpis a core point, a cl)ster is formed Ifpis a border point, no points are density-reachable

frompand :0!C(7 visits the ne.t point of the database

Contin)e the process )ntil all of the points have beenprocessed

-/


46/101

"ara*eters

-

"T : A uster$ r er ng Met o


47/101

"T : A uster r er ng Met o(1666)

5P/IC!* 5rdering Points /o Identify the Cl)stering!tr)ct)re (n1erst, 0re)nig, Kriegel, and !ander !I?M5: Prod)ces a special order of the database wrt its

density-based cl)stering str)ct)re /his cl)ster-ordering contains info e4)iv to the density-

based cl)sterings corresponding to a broad range ofparameter settings

?ood for both a)tomatic and interactive cl)steranalysis, incl)ding finding intrinsic cl)stering str)ct)re

Can be represented graphically or )sing vis)ali6ationtechni4)es

-

@"TC: o*e &8tension ro*


48/101

@"TC: o*e &8tension ro*D!CAB

Inde.-based*

1 L n)mber of dimensions

7 L $%

p L BZ

M L 7&-p L Comple.ity* 5%log%

Core :istance* min eps s't' point is core

>eachability :istance

p2

*i+t = 5

= 3 c.

*a! (c%re'ditace (%), d (%, p))

r(p1, %) = 28c. r(p2,%) = 4c.

%

%

p1

-4


49/101

%eac"abilit&!distance

Cluster!order

of t"e ob'ects

undefined

-6

Densit$!ased Clustering: @"TC L ts


50/101

/0

Densit !ased Clustering: @"TC L tsApplications

D&BC=&: sing tatistical Densit


51/101

D&BC=&: sing tatistical Densitunctions

:37sity-based C2Ust3ring by Hinneb)rg Keim K::D Using statistical density f)nctions*

Ma+or feat)res

!olid mathematical fo)ndation

?ood for data sets with large amo)nts of noise

(llows a compact mathematical description of arbitrarily shapedcl)sters in high-dimensional data sets

!ignificant faster than e.isting algorithm e'g', :0!C(7

0)t needs a large n)mber of parameters

f x y eGaussian

d x y

( , )( , )

=

2

22 =

= N

i

xxd

D

Gaussian

i

exf1

2

),(2

2

)(

=

= N

i

xxd

ii

D

Gaussian

i

exxxxf1

2),(

2

2

)(),( inuence o

on 8

totalinuence on

8

gradient o 8in the

direction o 8i

/1


52/101

Uses grid cells b)t only 1eeps information abo)t grid cells that doact)ally contain data points and manages these cells in a tree-basedaccess str)ct)re

Infl)ence f)nction* describes the impact of a data point within itsneighborhood

5verall density of the data space can be calc)lated as the s)m of theinfl)ence f)nction of all data points Cl)sters can be determined mathematically by identifying density

attractors :ensity attractors are local ma.imal of the overall density f)nction Center defined cl)sters* assign to each density attractor the points

density attracted to it (rbitrary shaped cl)ster* merge density attractors that are connected

thro)gh paths of high density N threshold

Denclue: Technical &ssence

/2


53/101

Densit Attractor

/3


54/101

Center$DeNned and Ar;itrar

/-

Chapter 10 Cluster Analsis: !asic


55/101

//





%rid$!ased Methods

&'aluation o Clustering u**ar

//


56/101

%rid$!ased Clustering Method

Using m)lti-resol)tion grid data str)ct)re !everal interesting methods

!/I7? a !/atistical I7formation ?rid approach by


57/101

TB%: A tatistical nor*ation %ridApproach


58/101

The TB% Clustering Method

3ach cell at a high level is partitioned into a n)mber ofsmaller cells in the ne.t lower level

!tatistical info of each cell is calc)lated and storedbeforehand and is )sed to answer 4)eries

Parameters of higher level cells can be easily calc)latedfrom parameters of lower level cell count, mean, s, min, max type of distrib)tion\normal, uniform, etc'

Use a top-down approach to answer spatial data 4)eries !tart from a pre-selected layer\typically with a small

n)mber of cells "or each cell in the c)rrent level comp)te the confidence

interval/4


59/101

TB% Algorith* and ts Analsis

>emove the irrelevant cells from f)rther consideration epeat this process )ntil the bottom layer is reached

(dvantages* =)ery-independent, easy to paralleli6e, incremental

)pdate O*K+,where Kis the n)mber of grid cells at the lowest

level :isadvantages*

(ll the cl)ster bo)ndaries are either hori6ontal orvertical, and no diagonal bo)ndary is detected

/6

( l i )


60/101

0

C=& (Clustering n &st)

(grawal, ?ehr1e, ?)nop)los, >aghavan !I?M5:@D ()tomatically identifying s)bspaces of a high dimensional data space

that allow better cl)stering than original space

C2I=U3 can be considered as both density-based and grid-based

It partitions each dimension into the same n)mber of e4)al lengthinterval

It partitions an m-dimensional data space into non-overlappingrectang)lar )nits

( )nit is dense if the fraction of total data points contained in the )nite.ceeds the inp)t model parameter

( cl)ster is a ma.imal set of connected dense )nits within as)bspace

h 7


61/101

1

C=&: The Ma7or teps

Partition the data space and find the n)mber of points that lieinside each cell of the partition'

Identify the s)bspaces that contain cl)sters )sing the (prioriprinciple

Identify cl)sters :etermine dense )nits in all s)bspaces of interests :etermine connected dense )nits in all s)bspaces of

interests'

?enerate minimal description for the cl)sters :etermine ma.imal regions that cover a cl)ster of

connected dense )nits for each cl)ster :etermination of minimal cover for each cl)ster

0)

%(


62/101

2

Salar

(10,0

00

20 30 40 50 60ae

5

43

1

2

6

7

0

20 30 40 50 60ae

5

43

1

2

6

7

0

acati%

8ee9)

ae

acati%

Sala

r5 30 50

= 3

trength and +ea


63/101

3

CLIQUE

!trength automaticallyfinds s)bspaces of the highest

dimensionality s)ch that high density cl)sters e.ist inthose s)bspaces

insensitiveto the order of records in inp)t and does notpres)me some canonical data distrib)tion

scaleslinearlywith the si6e of inp)t and has goodscalability as the n)mber of dimensions in the data

increases


64/101

-





%rid$!ased Methods


u**ar

-


65/101

Assessing Clustering Tendenc

(ssess if non-random str)ct)re e.ists in the data by meas)ring theprobability that the data is generated by a )niform data distrib)tion

/est spatial randomness by statistic test* Hop1ins !tatic ?iven a dataset : regarded as a sample of a random variable o,

determine how far away o is from being )niformly distrib)ted in thedata space

!ample npoints,p, , pn, )niformly from :' "or each pi, find itsnearest neighbor in :* xiL min2dist *pi, v+3where vin :

!ample npoints, 0, , 0n, )niformly from :' "or each 0i, find itsnearest neighbor in : Q0iR* yiL min2dist *0i, v+3where vin : and v

W 0i Calc)late the Hop1ins !tatistic*

If : is )niformly distrib)ted, ] .iand ] yiwill be close to each otherand H is close to %'' If : is highly s1ewed, H is close to %

!


66/101

Deter*ine the Bu*;er o Clusters

3mpirical method E of cl)sters ^_n;$ for a dataset of n points

3lbow method Use the t)rning point in the c)rve of s)m of within cl)ster variance

w'r't the E of cl)sters

Cross validation method :ivide a given data set into mparts Use m` & parts to obtain a cl)stering model Use the remaining part to test the 4)ality of the cl)stering

3'g', "or each point in the test set, find the closest centroid, and)se the s)m of s4)ared distance between all points in the testset and the closest centroids to meas)re how well the model fitsthe test set

"or any 1 N %, repeat it mtimes, compare the overall 4)ality meas)re

w'r't' different k4s, and find E of cl)sters that fits the data the best


67/101

Measuring Clustering ualit

/wo methods* e.trinsic vs' intrinsic 3.trinsic* s)pervised, i'e', the gro)nd tr)th is available

Compare a cl)stering against the gro)nd tr)th )sing

certain cl)stering 4)ality meas)re

3.' 0C)bed precision and recall metrics

Intrinsic* )ns)pervised, i'e', the gro)nd tr)th is )navailable

3val)ate the goodness of a cl)stering by considering

how well the cl)sters are separated, and how compactthe cl)sters are

3.' !ilho)ette coefficient

"

Measuring Clustering ualit: &8trinsich d


68/101

Methods

Cl)stering 4)ality meas)re* 5*", "g+,for a cl)stering "given the gro)nd tr)th "g' 5is good if it satisfies the following 4essential criteria

Cl)ster homogeneity* the p)rer, the better

Cl)ster completeness* sho)ld assign ob+ects belong tothe same category in the gro)nd tr)th to the same cl)ster >ag bag* p)tting a heterogeneo)s ob+ect into a p)re

cl)ster sho)ld be penali6ed more than p)tting it into a ragbagi'e', 8miscellaneo)s9 or 8other9 category

!mall cl)ster preservation* splitting a small category intopieces is more harmf)l than splitting a large category intopieces

#

Chapter 10. Cluster Analsis: !asic


69/101

6

p Concepts and Methods




%rid$!ased Methods


u**ar

6

u**ar


70/101

u**ar Cl)ster analysisgro)ps ob+ects based on their similarity and has wide

applications Meas)re of similarity can be comp)ted for vario)s types of data Cl)stering algorithms can be categori6edinto partitioning methods,

hierarchical methods, density-based methods, grid-based methods,and model-based methods

K-meansand K-medoidsalgorithms are pop)lar partitioning-basedcl)stering algorithms

0irchand Chameleonare interesting hierarchical cl)steringalgorithms, and there are also probabilistic hierarchical cl)steringalgorithms

:0!C(7, 5P/IC!, and :37C2Uare interesting density-basedalgorithms

!/I7?and C2I=U3are grid-based methods, where C2I=U3 is also as)bspace cl)stering algorithm

=)ality of cl)stering res)lts can be eval)ated in vario)s ways 0

ntroduction


71/101

1

ntroduction

Coverage

Cl)ster (nalysis* Chapter && 5)tlier :etection* Chapter &$ Mining !e4)ence :ata* 0K$* Chapter D Mining ?raphs :ata* 0K$* Chapter !ocial and Information 7etwor1 (nalysis

0K$* Chapter Partial coverage* Mar1 7ewman* 87etwor1s* (n Introd)ction9, 5.ford U', $%&% !cattered coverage* 3asley and Kleinberg, 87etwor1s, Crowds, and Mar1ets*

>easoning (bo)t a Highly Connected


72/101

5eerences (1)

>' (grawal, J' ?ehr1e, :' ?)nop)los, and P' >aghavan' ()tomatic s)bspacecl)stering of high dimensional data for data mining applications' !I?M5:D M' >' (nderberg' Cl)ster (nalysis for (pplications' (cademic Press, &B' M' (n1erst, M' 0re)nig, H'-P' Kriegel, and J' !ander' 5ptics* 5rdering points

to identify the cl)stering str)ct)re, !I?M5:@' 0eil "', 3ster M', ) '* "re4)ent /erm-0ased /e.t Cl)stering, K::%$ M' M' 0re)nig, H'-P' Kriegel, >' 7g, J' !ander' 25"* Identifying :ensity-0ased2ocal 5)tliers' !I?M5: $%%%' M' 3ster, H'-P' Kriegel, J' !ander, and ' )' ( density-based algorithm for

discovering cl)sters in large spatial databases' K::A' M' 3ster, H'-P' Kriegel, and ' )' Knowledge discovery in large spatial

databases* "oc)sing techni4)es for efficient class identification' !!:' :' "isher' Knowledge ac4)isition via incremental concept)al cl)stering'

Machine 2earning, $*&-&B$, &DB' :' ?ibson, J' Kleinberg, and P' >aghavan' Cl)stering categorical data* (n

approach based on dynamic systems' 2:0@D' ' ?anti, J' ?ehr1e, >' >ama1rishan' C(C/U! Cl)stering Categorical :ata

Using !)mmaries' K::'

2


73/101

5eerences (2)

:' ?ibson, J' Kleinberg, and P' >aghavan' Cl)stering categorical data* (napproach based on dynamic systems' In Proc' 2:0@D'

!' ?)ha, >' >astogi, and K' !him' C)re* (n efficient cl)stering algorithm forlarge databases' !I?M5:D'

!' ?)ha, >' >astogi, and K' !him' >5CK* ( rob)st cl)stering algorithm forcategorical attrib)tes' In 6"/E788, pp' &$-$&, !ydney, ()stralia, March&'

(' Hinneb)rg, :'l (' Keim* (n 3fficient (pproach to Cl)stering in 2argeM)ltimedia :atabases with 7oise' K::@D'

(' K' Jain and >' C' :)bes' (lgorithms for Cl)stering :ata' Printice Hall, &DD' ?' Karypis, 3'-H' Han, and ' K)mar' CH(M32357* ( Hierarchical Cl)stering

(lgorithm Using :ynamic Modeling' "OM9:E$, $D* AD-B, &'

2' Ka)fman and P' J' >o)ssee)w' "inding ?ro)ps in :ata* an Introd)ction toCl)ster (nalysis' John ' 7g' (lgorithms for mining distance-based o)tliers in large

datasets' 2:0@D'

3


74/101

5eerences (3)

?' J' Mc2achlan and K'3' 01asford' Mi.t)re Models* Inference and (pplications toCl)stering' John ' /' 7g' Constraint-0ased Cl)steringin 2arge :atabases, IC:/%&'

(' K' H' /)ng, J' Ho), and J' Han' !patial Cl)stering in the Presence of 5bstacles,IC:3%&

H' ' >ama1rishnan, and M' 2ivny' 0I>CH * (n efficient data cl)stering method

for very large databases' !I?M5:A ' [in, J' Han, and P' !' [), 82in1Cl)s* 3fficient Cl)stering via Heterogeneo)s !emantic

2in1s9, 2:0%A

-


75/101

!lides )n)sed in class

/

A T i l 9 M d id Al ith ("AM)


76/101

A Tpical 9$Medoids Algorith* ("AM)

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

Total Cost 2

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

92

Ar;itrarchoose< o;7ect

asinitial*edoids

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

Assigneachre*aining

o;7ecttonearest*edoids 5ando*l select a

non*edoido;7ect?@ra*do*

Co*putetotal costosapping

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

Total Cost 2

apping@ and@ra*do*

qualit isi*pro'ed.

Do loop

Until nochange

0

1

2

3

-

/

4

6

10

0 1 2 3 - / 4 6 10

"AM ("artitioning Around Medoids)(164)


77/101

(164)

P(M Ka)fman and >o)ssee)w, &DB, b)ilt in !pl)s Use real ob+ect to represent the cl)ster

!elect krepresentative ob+ects arbitrarily

"or each pair of non-selected ob+ect hand selectedob+ect i, calc)late the total swapping cost TCih "or each pair of iand h,

If :"ihF %, iis replaced by h

/hen assign each non-selected ob+ect to the most

similar representative ob+ect

repeat steps $- )ntil there is no change

"AM Clustering: inding the !est ClusterCenter


78/101

4

Center

Case &* p c)rrently belongs to o+' If o+is replaced by orandomas arepresentative ob+ect and p is the closest to one of the otherrepresentative ob+ect oi, then p is reassigned to oi

+hat s the "ro;le* ith "AM,


79/101

6

+hat s the "ro;le* ith "AM,

Pam is more rob)st than 1-means in the presence ofnoise and o)tliers beca)se a medoid is less infl)enced by

o)tliers or other e.treme val)es than a mean

Pam wor1s efficiently for small data sets b)t does not

scale wellfor large data sets'

51n-1$ for each iteration

where n is E of data,1 is E of cl)sters

!ampling-based method

C2(>(Cl)stering 2(>ge (pplications

CLARA(Clustering =arge Applications)(1660)


80/101

40

(1660)

"#!$!Ka)fmann and >o)ssee)w in &% 0)ilt in statistical analysis pac1ages, s)ch as !Pl)s It draws multiple samplesof the data set, applies !M

on each sample, and gives the best cl)stering as theo)tp)t

!trength* deals with larger data sets than !M


81/101

41

(1994)

"#!$!%&( Cl)stering (lgorithm based on >andomi6ed!earch 7g and Han@ :raws sample of neighbors dynamically /he cl)stering process can be presented as searching a

graph where every node is a potential sol)tion, that is, aset of kmedoids If the local optim)m is fo)nd, itstarts with new randomly

selected node in search for a new local optim)m

(dvantages* More efficient and scalable than both !Mand "#!$! ")rther improvement* "oc)sing techni4)es and spatial

access str)ct)res 3ster et al'@

5@C9: Clustering Categorical Data


82/101

42

5@C9: Clustering Categorical Data

>5CK* >5b)st Cl)stering )sing linKs !' ?)ha, >' >astogi K' !him, IC:3@ Ma+or ideas

Use lin1s to meas)re similarity;pro.imity

7ot distance-based (lgorithm* sampling-based cl)stering

:raw random sample Cl)ster with lin1s 2abel data in dis1

3.periments Congressional voting, m)shroom data

i*ilarit Measure in 5@C9


83/101

43

i*ilarit Measure in 5@C9

/raditional meas)res for categorical data may not wor1 well, e'g',Jaccard coefficient

3.ample* /wo gro)ps cl)sters of transactions C&' Fa, b, c, d, eN* Qa, b, cR, Qa, b, dR, Qa, b, eR, Qa, c, dR, Qa, c, eR,

Qa, d, eR, Qb, c, dR, Qb, c, eR, Qb, d, eR, Qc, d, eR

C$' Fa, b, f, gN* Qa, b, fR, Qa, b, gR, Qa, f, gR, Qb, f, gR Jaccard co-efficient may lead to wrong cl)stering res)lt

C&* %'$ Qa, b, cR, Qb, d, eRR to %' Qa, b, cR, Qa, b, dR C& C$* co)ld be as high as %' Qa, b, cR, Qa, b, fR

Jaccard co-efficient-based similarity f)nction*

3.' 2et :L Qa, b, cR, :' L Qc, d, eR

Sim

( , )1 2 1 2

1 2

=

2405

1

;,,,,epeatedly find gro)ps of tightly related nodes, which

are merged into a higher-level node /ightness of a gro)p of nodes

"or a gro)p of nodes


95/101

Mining

"inding tight gro)ps "re4)ent pattern mining

Proced)re of initiali6ing a tree !tart from leaf nodes level-% (t each level l, find non-overlapping gro)ps of similar nodes

with fre4)ent pattern mining

$educed to

g1

g2


96/101

Ad7usting i*Tree tructures

(fter similarity changes, the tree str)ct)re also needs to bechanged If a node is more similar to its parent@s sibling, then move

it to be a child of that sibling /ry to move each node to its parent@s sibling that it is most

similar to, )nder the constraint that each parent node canhave at most cchildren

n1 n2

n4 n5 n6

n3

n7 n9n8

08

09

n7

6

Co*ple8it


97/101

Co*ple8it

/ime !pace

Updating similarities OMlog%$ OM1%

(d+)sting tree str)ct)res O% O%

#ink"lus OMlog%$ OM1%

&im$ank OM$ O%$

or to tpes o o;7ects? Nin each? and Mlin


98/101

&8peri*ent: &*ail Dataset

"' 7ielsen' 3mail dataset'

www'imm'dt)'d1;rem;data;3mail-&&'6ip B% emails on conferences, $B$ on +obs,

and BD spam emails (cc)racy* meas)red by man)ally labeled

data

(cc)racy of cl)stering* Z of pairs of ob+ectsin the same cl)ster that share common label

!pproach !ccuracy time *s+

2in1Cl)s %'D%$A &B'A

!im>an1 %'BA &A%

>eCom %'B&& B'A

"-!im>an1 %'ADD B'B

C2(>(7! %'BAD D'

(pproaches compared* !im>an1 Jeh an1 with "ingerPrints "-!im>an1* "ogaras >ac6, eCom


99/101

(1664)

!hei1holeslami, Chatter+ee, and hang 2:0@D ( m)lti-resol)tion cl)stering approach which applies wavelet transform to

the feat)re spaceV both grid-based and density-based


100/101

The +a'eCluster Algorith*

How to apply wavelet transform to find cl)sters !)mmari6es the data by imposing a m)ltidimensional grid

str)ct)re onto data space /hese m)ltidimensional spatial data ob+ects are represented in a

n-dimensional feat)re space

(pply wavelet transform on feat)re space to find the denseregions in the feat)re space (pply wavelet transform m)ltiple times which res)lt in cl)sters at

different scales from fine to coarse Ma+or feat)res*

Comple.ity 57 :etect arbitrary shaped cl)sters at different scales 7ot sensitive to noise, not sensitive to inp)t order 5nly applicable to low dimensional data

101

uantiEationL Transor*ation


101/101

L Transor*ation

=)anti6e data into m-: grid str)ct)re,

then wavelet transforma scale &* high resol)tion

b scale $* medi)m resol)tion

c scale * low resol)tion

10clusbasic

Documents

methods cluster analsis

data distribtions

data segmentation

collection of data ob

distinctivebetween clsters

eparation of clsters

clsters unspervised

similarity measre