similarity/clustering 인공지능연구실 문홍구 2006. 1. 17. 2 content what is clustering ...
TRANSCRIPT
Similarity/Clustering
인공지능연구실문홍구
2006. 1. 17
2
Content
What is Clustering
Clustering Method Distance-based
-Hierarchical
-Flat Geometric embedding approach
-self-organizing maps
-multidimensional scaling
-latent semantic indexing
3
Formulations and Approaches
Partitioning Approaches One possible goal that we can set up for a clustering algorithm is t
o
partition the document collection into k subsets or clusters D1,···,Dk so
as to minimize the intracluster distance or maximize the intracluster resemblance.
Bottom-up clustering
Top-down clustering
4
Formulations and Approaches
5
Distance based
Hierarchical clustering
-The tree of hierarchical clustering can be produced Bottom-up(agglomerative clustering)
– start with the individual object and grouping the most similar ones
– join cluster with maximum similarity
Top-down(divisive clustering)
– start with all the object and divides them into groups in order to maximize within-group similarity
– split least coherent part in cluster
6
Three methods in hierarchical clustering
Single-link Similarity of two most similar members
Complete link Similarity of two least similar members
Group average Average similarity between members
7
Single link Clustering
Similarity of two most similar members => O(n2) Locally Coherent
close objects are in the same cluster
Chaining Effect Because of following a chain of large similarities without taking
into account the global context => low global cluster quality
8
Complete link Clustering
Similarity of two least similar members => O(n3) The function focused on global cluster quality
avoids elongated cluster a/f or b/e is tighter than a/d (tighter cluster are better than
‘straggly’ cluster)
9
Group average agglomerative clustering
Averages similarity between members The complexity of computing average similarity is O(n2)
Average similarities are computed at each time a new group is formed
compromise between single-link and complete-link
10
Comparison
Single-link Relative efficient Long straggly clusters
– Ellipsoidal cluster Loosely bound cluster
Complete-link Tightly bound cluster
Group average Intermediate between single and complete
11
Distance based
Flat clustering
-k – means
- k – means 군집방법은 계층적 군집 분석과는 달리 개체가 어느 한 군집에만 속하도록 하는 상호 배반적 군집 방법이다 .
이 방법은 군집의 수를 미리 정하고 , 각 개체가 어느 군집에 속 하는지를 분석하는 방법으로서 대량의 데이터의 군집분석에 유용하게 이용되는 방법이다 .
12
Distance based
k – means
13
Geometric Embedding Approaches
Self - organizing maps
Multidimensional scaling
Latent semantic indexing
★ A different form of partition-based clustering is to identify dense regions in space.
14
Geometric Embedding Approaches
Self - organizing maps(SOMs)
- Self – organizing maps are a close cousin to k-means, except that
unlike k-means, which is concerned only with determining the
association between clusters and documents, the SOM algorithm
also embeds the clusters in a low – dimensional space right from
the beginning and proceeds in as way that places related clusters
close together in that space.
15
SOM : Example
SOM computed from over a million documents taken from 80 Usenet newsgroups. Light
areas have a high density of documents.
16
Geometric Embedding Approaches
Multidimensional scaling (MDS)
-The goal of MDS is to present documents as point in a low – dimensional
space (often 2D-3D) such that the Euclidean distance between any pair of
points is as close as possible to the distance between them specified by the
input
17
Geometric Embedding Approaches
Latent semantic indexing (LSI)
- The latent semantic indexing (LSI) method is an attempt to solve the
synonymy problem while staying within the vector space model
framework
18
Latent semantic indexing (LSI)
-
k
k-dim vector
A
Documents
Ter
ms
U
d
t
r
D V
d
SVD
Term Document
car
auto
19
EM algorithm
A soft version of K-means clustering
① both cluster move towards the centroid of all three objects
② reach the stable final state
20
EM algorithm(2)
We want to calculate probability P(cj| vector xi)
Assume that clusteri has a normal distribution
Maximum likelihood of the form
)()(
2
1exp
)2(
1),;( 1
jjT
j
jm
jj xxxn
k
jjjiji xnxP
1
),;()(
21
Procedure of EM
Expectation Step (E) Compute hij that is expectation of zij
Maximization Step (M)
n
h
h
xxh
h
xhn
iij
jn
iij
n
i
Tjijiij
jn
iij
n
iiij
j
1
1
1
1
1
))((
k
1lli
jiiijij
Θ);n|xP(
Θ);n|xP(Θ);x|E(zh