data mining summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/nguyentrithanh... · 2016. 8....

46
Data Mining Summer school CLUSTER ANALYSIS Asso. Prof. NGUYEN Tri Thanh University of Engineering and Technology August 11, 2016 1

Upload: others

Post on 21-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Data Mining Summer school

CLUSTER ANALYSIS Asso. Prof. NGUYEN Tri Thanh

University of Engineering and Technology

August 11, 2016 1

Page 2: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 2

Outline

1.  Introduction 2.  Clustering applications

3.  A Categorization of Major Clustering Methods

4.  Data representation

5.  Flat clustering

6.  Hierarchical clustering

7.  Cluster evaluation

8.  Summary

9.  Discussion

Page 3: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 3

Introduction

n  Cluster analysis (clustering) n  Given a set of objects, group similar data according to

the characteristics into clusters n  Cluster: A collection of data objects

n  similar (or related) to one another within the same group n  dissimilar (or unrelated) to the objects in other groups

n  How to identify the similarity? n  How to identify the number of clusters?

Page 4: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 4

Introduction

n  Clustering is embedded in human naturally n  Group animals, plants n  Group students n  Group customers n  Facebook groups n  Interest groups n  …

Page 5: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 5

Clustering vs Classification

n  Clustering n  No predefined classes (unknown number of clusters) n  Unlabeled data objects n  Unsupervised learning

n  Classification n  Predefined classes n  Labeled data objects n  Predict/identify the class of an unlabeled object n  Supervised learning

Page 6: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 6

Clustering applications

n  Image Processing and Pattern Recognition n  Spatial Data Analysis

n  Create thematic maps in GIS by clustering feature spaces

n  Detect spatial clusters or for other spatial mining tasks n  Economic Science (especially market research) n  WWW

n  Document classification n  Cluster Weblog data to discover groups of similar

access patterns

Page 7: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 7

Clustering Applications

n  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing

programs

n  Land use: Identification of areas of similar land use in an earth

observation database

n  Insurance: Identifying groups of motor insurance policy holders with a

high average claim cost

n  City-planning: Identifying groups of houses according to their house

type, value, and geographical location

n  Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults

Page 8: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 8

Categorization of Clustering Approaches

n  Partitioning approach: n  Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors n  Typical methods: k-means, k-medoids, CLARANS

n  Hierarchical approach: n  Create a hierarchical decomposition of the set of data (or objects)

using some criterion n  Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

n  Density-based approach: n  Based on connectivity and density functions n  Typical methods: DBSACN, OPTICS, DenClue

n  Grid-based approach: n  based on a multiple-level granularity structure n  Typical methods: STING, WaveCluster, CLIQUE

Page 9: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 9

Major Clustering Approaches (II)

n  Model-based: n  A model is hypothesized for each of the clusters and tries to find the

best fit of that model to each other n  Typical methods: EM, SOM, COBWEB

n  Frequent pattern-based: n  Based on the analysis of frequent patterns n  Typical methods: p-Cluster

n  User-guided or constraint-based: n  Clustering by considering user-specified or application-specific

constraints n  Typical methods: COD (obstacles), constrained clustering

n  Link-based clustering: n  Objects are often linked together in various ways n  Massive links can be used to cluster objects: SimRank, LinkClus

Page 10: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Flat clustering

August 11, 2016 Data Mining: Concepts and Techniques 10

Page 11: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Data object representation

n  A set/vector of features/attributes n  Person: name, age, sex, job, … n  Text: set of distinct words

n  Similarity n  The distance: the smaller the more similar n  The similarity: the bigger the more similar

August 11, 2016 11

∑∑∑

==

==•

=n

i in

i i

n

i ii

dd

ddddddddsim

1221

21

1 21

21

2121 ),(

21 2121 )(),( ∑ =

−=n

i ii dddddis

Page 12: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 12

K-means Clustering

n  Number of clusters (k) is known in advance

n  Clusters are represented by the centroid of the documents that belong to that cluster

n  Cluster membership is determined by the most similar cluster centroid

∑∈

=Sdd

Sc 1

same. remain the centroidscluster or the iterations econsecutiv in twocluster each toassigned are documents same thei.e. converges, process theuntil 2 Step toGo 4.

members.cluster computednewly theusing centroidcluster therecomputecluster each For 3.

cluster. ingcorrespond theodocument tt assign tha and centroidsimilar most thefinddocument each for i.e.

centroids,cluster theosimility t their toaccording clusters todocumentsAssign 2.random.at doneusually is This centroids.cluster as used be to from documents Select 1. Sk

Page 13: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 13

The K-Means Clustering Method

n  Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassign reassign

Page 14: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 14

K-means Clustering Discussion

n  In step 2 documents are moved between clusters in order to maximize the intra-cluster similarity

n  The clustering maximizes the criterion function (a measure for evaluating clustering quality)

n  In distance-based k-means clustering the criterion function is the sum of squared errors (based on Euclidean distance and means)

n  For k-means clustering of documents a function based on centroids and similarity is used

∑ ∑= ∈

=k

i Ddji

ij

dcsimJ1

),(

Page 15: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 15

K-means Clustering Discussion (cont’d)

n  Clustering that maximizes this function is called minimum variance clustering

n  K-means algorithm produces minimum variance clustering, but does not guarantee that it always finds the global maximum of the criterion function

n  After each iteration the value of J increases, but it may converge to a local maximum

n  The result greatly depends on the initial choice of cluster centroids

Page 16: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Sample data

August 11, 2016 Data Mining: Concepts and Techniques 16

Page 17: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 17

K-means Clustering Example (result) Clustering of CCSU Departments data with 6 TFIDF attributes (k = 2)

Page 18: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Hierarchical clustering

August 11, 2016 Data Mining: Concepts and Techniques 18

Page 19: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 19

Hierarchical Partitioning

n  Produces a nested sequence of partitions of the set of data points n  Can be displayed as a tree (called dendrogram) with a single cluster

including all points at the root and singleton clusters (individual points) at the leaves

n  Example of hierarchical partitioning of set of numbers {1, 2, 4, 5, 8, 10}

The similarity measure used in this example is computed as (10-d)/10 where d is the distance between data points or cluster centers

Page 20: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 20

Approaches to Hierarchical Partitioning

n  Agglomerative n  Starts with the data points and at each step merges the

two closest (most similar) points (or clusters at later steps) until a single cluster remains

n  Divisible

n  Starts with the original set of points and at each step splits a cluster until only individual points remain

n  To implement this approach we need to decide which cluster to split and how to perform the split

Page 21: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 21

Approaches to Hierarchical Partitioning (cont’d)

n  The agglomerative approach is more popular as it needs only the definition of a distance or similarity function on clusters/points

Page 22: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 22

Approaches to Hierarchical Partitioning (cont’d)

n  For data points in the Euclidean space the Euclidean distance is the best choice

n  For documents represented as TF-IDF vectors the preferred measure is the cosine similarity defined as follows

∑∑∑

==

==•

=n

i in

i i

n

i ii

dd

ddddddddsim

1221

21

1 21

21

2121 ),(

Page 23: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 23

Agglomerative Hierarchical Clustering

n  There are several versions of this approach depending on how similarity on clusters sim(S1,S2) is defined (S1,S2 are sets of documents) n  Similarity between cluster centroids, i.e. sim(S1,S2)=sim(c1,c2),

where the centroid c of cluster S is

n  Maximum similarity between documents from each cluster (nearest neighbor clustering)

∑∈

=Sdd

Sc 1

),(max),( 21,212211

ddsimSSsimSdSd ∈∈

=

Page 24: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 24

Agglomerative Hierarchical Clustering (cont’d)

n  Minimum similarity between documents from each cluster (farthest neighbor clustering)

n  Average similarity between documents from each cluster

),(min),( 21,212211

ddsimSSsimSdSd ∈∈

=

∑∈∈

=2211 ,

2121

21 ),(1),(SdSd

ddsimSS

SSsim

Page 25: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 25

Agglomerative Clustering Algorithm

n  S is the initial set of documents and G is the clustering tree n  k and q are control parameters that stop merging

n  when a desired number of clusters (k) is reached

n  or when the similarity between the clusters to be merged becomes less than a specified threshold (q)

Page 26: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 26

Agglomerative Clustering Algorithm (cont’d)

For n documents both time and space complexity of the algorithm are O (n2)

2 toGo 7.

hierarchy) thecluster to new theadd and , and (merge},{.6

from and Remove 5.) than less

is clustersclosest theof similarity theif(stopexitthen If4.clusters)closest

twothefind(),(maxarg),(thatsuch,,Find.3)reachedisclustersofnumberdesiredtheif(stopexitthenIf.2

) fromdocument a containing oneeach clusters,singleton withinitialize(}|}{{.1

),(

jiji

ji

ji

jijiji

SSSSGGGSS

qq),Ssim(S

SSsimjiGSSkG

SGSddG

∪=

<

=∈

∈←

Page 27: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 27

Agglomerative Clustering Example 1

Page 28: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 28

Agglomerative Clustering Example 1

Page 29: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

DIANA clustering algorithm

August 11, 2016 Data Mining: Concepts and Techniques 29

document single a have clusters all when Stop 5.

1 goto,1

diameter biggest thehaving cluster Select the.4

Update.0 no is thereuntilRepeat .3}{}{,0 if ,max having Find 2.2.

)||()||( calculate, allFor .12.}{

others thefromdistinct most theis that ddocument a Find.1);documents

all ofcluster singleton withinitialize(},...,,{.0

1 1

2

21

)m(m-

)d(dg

G lddSlld

ddavgddavglSddS

GgGdddG

m

i

m

jji

h

hhhh

Sdi

Sdiii

n

ii

∑∑

∑∑

= =

∉∉

>

∪←>

−−−=∉

Page 30: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Clustering evaluation

August 11, 2016 Data Mining: Concepts and Techniques 30

Page 31: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 31

Web Content Mining

n  Evaluating Clustering n  Similarity-Based Criterion Functions

n  Probabilistic Criterion Functions

n  MDL-Based Model and Feature Evaluation

n  Classes to Clusters Evaluation

n  Precision, Recall and F-measure

n  Entropy

Page 32: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 32

Similarity-Based Criterion Functions (distance)

n  Basic idea: the cluster center ci (centroid or mean in case of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di

n  Alternative formulation based on pairwise distance between cluster members

2

1∑∑= ∈

−=k

i Dxie

i

cxJ ∑∈

=iDxi

i xD

c 1

2

1 ,

121∑ ∑= ∈

−=k

i Dxxlj

ie

ilj

xxD

J

Page 33: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 33

Similarity-Based Criterion Functions (cosine similarity)

n  For document clustering the (centroid) cosine similarity is used

n  Equivalent form based on pairwise similarity

n  Another formulation based on intracluster similarity (used to controls merging of clusters in hierarchical agglomerative clustering)

∑ ∑= ∈

=k

i Ddjis

ij

dcsimJ1

),(ji

jiji dc

dcdcsim

•=),(

∑ ∑= ∈

=k

i Dxxkj

iS

ikj

ddsimD

J1 ,

),(121

∑∈

=ij Ddj

ii dD

c 1

∑∑ ∑== ∈

==k

iii

k

i Dxxlj

iS DsimDddsim

DJ

ilj 11 ,)(

21),(1

21

Avegrage pair-wise similarity

Page 34: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 34

Classes to Clusters Evaluation

n  Assume that the classification of the documents in a sample is known, i.e. each document has a class label

n  Cluster the sample without using the class labels

n  Assign to each cluster the class label of the majority of documents in it

n  Compute the error as the proportion of documents with different class and cluster label

n  Or compute the accuracy as the proportion of documents with the same class and cluster label

Page 35: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 35

Classes to Clusters Evaluation (Example)

Page 36: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 36

Confusion matrix (contingency table)

TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)

FNTNFPTPFNFPError

+++

+= FNTNFPTP

TNTPAccuracy+++

+=

Page 37: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 37

Precision and Recall

FNTPTP

FPTPTP

+=

+=

Recall

Precision

Page 38: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 38

F-Measure

∑ ∑

∑∑

= =

=

==

==

=

=

=

+=

==

m

i

k

j ij

k

j iji

m

ikj

i

k

j ij

ijm

i ij

ij

nn

nn

jiFnnF

jiRjiPjiRjiPjiF

n

njiR

n

njiP

1 1

1

1,...,1

11

),(

clustering whole theEvaluating),(),(),(),(2),(

),(),(

recall andprecision Combining

max

Generalized confusion matrix form classes and k clusters

Total number of documents

Page 39: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 39

F-Measure (Example)

Page 40: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 40

Entropy

n  Consider the class label as a random event and evaluate its probability distribution in each cluster

n  The probability of class i in cluster j is estimated by the proportion of occurrences of class label i in cluster j

n  The entropy is as a measure of “impurity” and accounts for the average information in an arbitrary message about the class label

∑=

= m

iij

ijij

n

np

1

∑=

−=m

iijijj ppH

1

log

Page 41: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 41

Entropy (cont’d)

n  To evaluate the whole clustering we sum up the entropies of individual clusters weighted with the proportion of documents in each

∑=

=k

ij

j Hnn

H1

Page 42: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 42

Entropy (Examples)

n  A “pure” cluster where all documents have a single class label has entropy of 0

n  The highest entropy is achieved when all class labels have the same probability

n  For example, for a two class problem the 50-50 situation has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1

Page 43: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 43

Entropy (Examples) (cont’d)

n  Compare the entropies of the previously discussed clusterings for attributes offers and students

869204.0

54log

54

51log

51

205

155log

155

1510log

1510

2015)(

878176.096log

96

93log

93

209

113log

113

118log

118

2011)(

=⎟⎠

⎞⎜⎝

⎛ −−+⎟⎠

⎞⎜⎝

⎛ −−=

=⎟⎠

⎞⎜⎝

⎛ −−+⎟⎠

⎞⎜⎝

⎛ −−=

studentsH

offersH

Page 44: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

August 11, 2016 Data Mining: Concepts and Techniques 44

Summary

n  Cluster analysis groups objects based on their similarity and has wide applications

n  Measure of similarity can be computed for various types of data

n  Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

n  Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches

n  There are still lots of research issues on cluster analysis

Page 45: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Reference

n  J. Han and M. Kamber, Data Mining-Concepts and Techniques, Morgan Kaufmann, 2006

n  Z. Markov and D. T. Larose, Data mining the web, uncovering patterns in Web content, structure and usage, John Wiley & Sons, 2007

n  Nguyễn Hà Nam, Nguyễn Trí Thành, Hà Quang Thụy, Giáo trình khai phá dữ liệu, NXB ĐHQGHN, 2014

August 11, 2016 Data Mining: Concepts and Techniques 45

Page 46: Data Mining Summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/NguyenTriThanh... · 2016. 8. 18. · 1 each cluster in two consecutiv e iterations or the cluster centroids remain

Discussion

August 11, 2016 Data Mining: Concepts and Techniques 46