similarity/clustering 인공지능연구실 문홍구 2006. 1. 17. 2 content what is clustering ...

Similarity/Clustering

인공지능연구실문홍구

2006. 1. 17

2

Content

What is Clustering

Clustering Method Distance-based

-Hierarchical

-Flat Geometric embedding approach

-self-organizing maps

-multidimensional scaling

-latent semantic indexing

3

Formulations and Approaches

Partitioning Approaches One possible goal that we can set up for a clustering algorithm is t

o

partition the document collection into k subsets or clusters D1,···,Dk so

as to minimize the intracluster distance or maximize the intracluster resemblance.

Bottom-up clustering

Top-down clustering

4

Formulations and Approaches

5

Distance based

Hierarchical clustering

-The tree of hierarchical clustering can be produced Bottom-up(agglomerative clustering)

– start with the individual object and grouping the most similar ones

– join cluster with maximum similarity

Top-down(divisive clustering)

– start with all the object and divides them into groups in order to maximize within-group similarity

– split least coherent part in cluster

6

Three methods in hierarchical clustering

Single-link Similarity of two most similar members

Complete link Similarity of two least similar members

Group average Average similarity between members

7

Single link Clustering

Similarity of two most similar members => O(n2) Locally Coherent

close objects are in the same cluster

Chaining Effect Because of following a chain of large similarities without taking

into account the global context => low global cluster quality

8

Complete link Clustering

Similarity of two least similar members => O(n3) The function focused on global cluster quality

avoids elongated cluster a/f or b/e is tighter than a/d (tighter cluster are better than

‘straggly’ cluster)

9

Group average agglomerative clustering

Averages similarity between members The complexity of computing average similarity is O(n2)

Average similarities are computed at each time a new group is formed

compromise between single-link and complete-link

10

Comparison

Single-link Relative efficient Long straggly clusters

– Ellipsoidal cluster Loosely bound cluster

Complete-link Tightly bound cluster

Group average Intermediate between single and complete

11

Distance based

Flat clustering

-k – means

- k – means 군집방법은 계층적 군집 분석과는 달리 개체가 어느 한 군집에만 속하도록 하는 상호 배반적 군집 방법이다 .

이 방법은 군집의 수를 미리 정하고 , 각 개체가 어느 군집에 속 하는지를 분석하는 방법으로서 대량의 데이터의 군집분석에 유용하게 이용되는 방법이다 .

12

Distance based

k – means

13

Geometric Embedding Approaches

Self - organizing maps

Multidimensional scaling

Latent semantic indexing

★ A different form of partition-based clustering is to identify dense regions in space.

14


Self - organizing maps(SOMs)

- Self – organizing maps are a close cousin to k-means, except that

unlike k-means, which is concerned only with determining the

association between clusters and documents, the SOM algorithm

also embeds the clusters in a low – dimensional space right from

the beginning and proceeds in as way that places related clusters

close together in that space.

15

SOM : Example

SOM computed from over a million documents taken from 80 Usenet newsgroups. Light

areas have a high density of documents.

16


Multidimensional scaling (MDS)

-The goal of MDS is to present documents as point in a low – dimensional

space (often 2D-3D) such that the Euclidean distance between any pair of

points is as close as possible to the distance between them specified by the

input

17


Latent semantic indexing (LSI)

- The latent semantic indexing (LSI) method is an attempt to solve the

synonymy problem while staying within the vector space model

framework

18

Latent semantic indexing (LSI)

-

k

k-dim vector

A

Documents

Ter

ms

U

d

t

r

D V

d

SVD

Term Document

car

auto

19

EM algorithm

A soft version of K-means clustering

① both cluster move towards the centroid of all three objects

② reach the stable final state

20

EM algorithm(2)

We want to calculate probability P(cj| vector xi)

Assume that clusteri has a normal distribution

Maximum likelihood of the form

)()(

2

1exp

)2(

1),;( 1

jjT

j

jm

jj xxxn

k

jjjiji xnxP

1

),;()(

21

Procedure of EM

Expectation Step (E) Compute hij that is expectation of zij

Maximization Step (M)

n

h

h

xxh

h

xhn

iij

jn

iij

n

i

Tjijiij

jn

iij

n

iiij

j

1

1

1

1

1

))((

k

1lli

jiiijij

Θ);n|xP(

Θ);n|xP(Θ);x|E(zh