data mining techniques and applications, 1 st edition hongbo du isbn 978-1-84480-891-5 © 2010...

42
Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster detection

Upload: estefania-wilmore

Post on 01-Apr-2015

233 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter Four

Basic techniques for cluster detection

Page 2: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter Overview

• The problem of cluster detection• Measuring proximity between data objects• The K-means cluster detection method• The agglomeration cluster detection method• Performance issues of the basic methods• Cluster evaluation and interpretation• Undertaking a clustering task in Weka

Page 3: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Problem of Cluster Detection

: centroids

• What is cluster detection?– Cluster: a group of objects known as members– The centre of a cluster is known as the centroid– Members of a cluster are similar to each other – Members of different clusters are different– Clustering is a process of discovering clusters

Page 4: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Problem of Cluster Detection• Outputs of cluster detection process

– Assigned cluster tag for members of a cluster– Cluster summary: size, centroid, variations, etc.

SubjectID Body Height Body WeightCluster

Tags1 125 61 2s2 178 90 1s3 178 92 1s4 180 83 1s5 167 85 1s6 170 89 1s7 173 98 1s8 135 40 2s9 120 35 2s10 145 70 2s11 125 50 2

SubjectID Body Height Body WeightCluster

Tags1 125 61 2s2 178 90 1s3 178 92 1s4 180 83 1s5 167 85 1s6 170 89 1s7 173 98 1s8 135 40 2s9 120 35 2s10 145 70 2s11 125 50 2

30

40

50

60

70

80

90

100

100 110 120 130 140 150 160 170 180 190

Body Height

Bo

dy

Wei

gh

t

Cluster 2: Size: 5Centroid:(130, 51) Variation: bodyHeight = 10,

bodyWeight = 14.48

Cluster 1: Size: 6Centroid:(154, 90) Variation: bodyHeight = 5.16

bodyWeight = 5.32

Page 5: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Problem of Cluster Detection• Basic elements of a

clustering solution– A sensible measure for

similarity, e.g. Euclidean– An effective and efficient

clustering algorithm, e.g. K-means

– A goodness-of-fit function for evaluating the quality of resulting clusters, e.g. SSE

??

?

Internal variation

Inter-cluster distance

Good or Bad?

Page 6: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Problem of Cluster Detection• Requirements for clustering solutions

– Scalability– Able to deal with different types of attributes– Able to discover clusters of arbitrary shapes– Minimal requirements for domain knowledge to

determine input parameters– Able to deal with noise and outliers– Insensitive to order of input data records– Able to deal with high dimensionality– Incorporation of user-specified constraints– Interpretability and usability

Page 7: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Basics

– Proximity between two data objects is represented by either similarity or dissimilarity

– Similarity: a numeric measure of the degree of alikeness, dissimilarity: numeric measure of the degree of difference between two objects

– Similarity measure and dissimilarity measure are often convertible; normally dissimilarity is preferred

– Measure of dissimilarity: • Measuring the difference between values of the

corresponding attributes • Combining the measures of the differences

Page 8: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity

• Distance function– Metric properties of function d:

• d(x, y) 0 and d(x, x) = 0, for all data objects x and y

• d(x, y) = d(y, x), for all data objects x and y

• d(x, y) d(x, z) + d(z, y), for all data objects x, y and z

– Difference of values for a single attribute is directly related to the domain type of the attribute.

– It is important to consider which operations are applicable.

– Some measure is better than no measure at all.

Page 9: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Difference between Attribute Values

– Difference between nominal values• If two names are the same, the difference is 0; otherwise the

maximume.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) =

• Same for difference between binary valuese.g. diff(Yes, No) =

– Difference between ordinal values• Different degree of proximity can be compared

e.g. diff(A, B) < diff(A, D). • Converting ordinal values to consecutive integers

e.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B 1 and A – D 3– Distance measure for interval and ratio attributes– Difference between values that may be unknown

diff(NULL, v) = |v|, diff(NULL, NULL) =

Page 10: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– Ratio of mismatched features for nominal attributesGiven two data objects i and j of p nominal attributes. Let m represent the number of attributes where the values of the two objects match.

pmpjid ),(

e.g.

3

1

6

46)2,1(

rowrowd

6

5

6

16)3,1(

rowrowd

Body Weight Body Height Blood Pressure Blood Sugar Habit Classheavy short high 3 smoker Pheavy short high 1 nonsmoker Pnormal tall normal 3 nonsmoker Nheavy tall normal 2 smoker Nlow medium normal 2 nonsmoker Nlow tall normal 1 nonsmoker P

normal medium high 3 smoker Plow short high 2 smoker P

heavy tall high 2 nonsmoker Plow medium normal 3 smoker P

heavy medium normal 3 nonsmoker N

Body Weight Body Height Blood Pressure Blood Sugar Habit Classheavy short high 3 smoker Pheavy short high 1 nonsmoker Pnormal tall normal 3 nonsmoker Nheavy tall normal 2 smoker Nlow medium normal 2 nonsmoker Nlow tall normal 1 nonsmoker P

normal medium high 3 smoker Plow short high 2 smoker P

heavy tall high 2 nonsmoker Plow medium normal 3 smoker P

heavy medium normal 3 nonsmoker N

Page 11: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– Minkowski function for interval/ratio attributes

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

Special cases:

Manhattan distance (q = 1)

Euclidean distance (q = 2)

Supremum/Chebyshev (q = )

||...||||),(2211 pp jxixjxixjxixjid

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

.max),( ttt

jijid

Page 12: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– Minkowski function for interval/ratio attributes (example)

customerID No of Trans Revenue Tenure(Months)101 30 1000 20102 40 400 30103 35 300 30104 20 1000 35105 50 500 1106 80 100 10107 10 1000 2

customerID No of Trans Revenue Tenure(Months)101 30 1000 20102 40 400 30103 35 300 30104 20 1000 35105 50 500 1106 80 100 10107 10 1000 2

6203020400100040301021011 ||||||),( custcustd

16600302040010004030102101 2222 .)()()(),( custcustd

6004001000102101 ||),(max custcustdNo. of Trans10 20 30 40 50

Tenure

10

20

30

Revenue

200400

600800

1000

40

ManhattanEuclideanChebyshev

Page 13: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– For binary attributes• Given two data objects i and j of p binary attributes,

– f00 : the number of attributes where i is 0 and j is 0– f01 : the number of attributes where i is 0 and j is 1– f10 : the number of attributes where i is 1 and j is 0– f11 : the number of attributes where i is 1 and j is 1

11100100

1001),(ffff

ffjiSMC

• Jaccard coefficient is defined for asymmetric values:

111001

1001),(fff

ffjiJC

• Simple mismatch coefficient (SMC) for symmetric values:

Page 14: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– For binary attributes (example)DocumentID Query Database Programming Interface User Usability Network Web GUI HTML

d1 1 1 0 0 0 0 0 0 0 0d2 0 1 1 0 0 0 0 0 0 0d3 0 1 0 0 0 0 0 0 0 0

10

2

1117

11)2,1(

11100100

1001

ffff

ffddSMC

3

2

111

11)2,1(

111001

1001

fff

ffddJC

10

1

118

1)3,1(

11100100

1001

ffff

ffddSMC

2

1

11

1)3,1(

111001

1001

fff

ffddJC

SMC not that different; JC very different: two-word (out of 3) difference

SMC very similar; JC still quite different: one word (out of 2) difference

Page 15: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Similarity between data objects

– Cosine similarity function• Treating two data objects as vectors• Similarity is measured as the angle between the two vectors• Similarity is 1 when = 0, and 0 when = 90 • Similarity function:

||||||||),cos(

ji

jiji

n

kkk jiji

1

n

kkii

1

2||||

i

j

Page 16: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Similarity between data objects

– Cosine similarity function (illustrated)

Given two data objects: x = (3, 2, 0, 5), and y = (1, 0, 0, 0)

Since,x y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = sqrt(32 + 22 + 02 + 52) 6.16

||y|| = sqrt(12 + 02 + 02 + 02) = 1

Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49

The dissimilarity between x and y: 1 – cos(x,y) = 0.51

Page 17: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– Combining heterogeneous attributes• Based on the principle of ratio of mismatched features

• For the kth attribute, compute the dissimilarity dk in [0,1]

• Set the indicator variable k as follows:

k = 0, if the kth attribute is an asymmetric binary attribute and both objects have value 0 for the attribute

k = 1, otherwise

• Compute the overall distance between i and j as:

n

kk

n

kkk d

jid

1

1),(

Page 18: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Measures of Proximity• Distance between data objects

– Attribute scaling• When:

– on the same attribute when data from different data sources are merged

– on different attributes when data is projected into the N-space

• Normalising variables into comparable ranges:– divide each value by the mean– divide each value by the range– z-score

– Attribute weighting• The weighted overall dissimilarity function:

n

kk

n

kkkk dw

jid

1

1),(

Page 19: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• Outline of main steps

1. Define the number of clusters (k)

2. Choose k data objects randomly to serve as the initial centroids for the k clusters

3. Assign each data object to the cluster represented by its nearest centroid

4. Find a new centroid for each cluster by calculating the mean vector of its members

5. Undo the memberships of all data objects. Go back to Step 3 and repeat the process until cluster membership no longer changes or a maximum number of iterations is reached.

Page 20: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• Illustration of the method:

Page 21: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• Strengths & weaknesses

– Strengths• Simple and easy to implement• Quite efficient

– Weaknesses• Need to specify the value of k, but we may not

know what the value should be beforehand• Sensitive to the choice of initial k centroids: the

result can be non-deterministic• Sensitive to noise• Applicable only when mean is meaningful to

the given data set

Page 22: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• Overcoming the weaknesses:

– Using cluster quality to determine the value of k– Improving how the initial k centroids are chosen

• Running the clustering a number of times and select the result with highest quality

• Using hierarchical clustering to locate the centres• Finding centres that are farther apart

– Dealing with noise• Removing outliers before clustering?• K-medoid method, using the nearest data object to the

virtual centre as the centroid.

– When mean cannot be defined, • K-mode method, calculating mode instead of mean for

the centre of the cluster.

Page 23: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• Value of k and cluster quality

Clu

ster

err

ors

(e.g

. S

SE

)

Number of clusters

Scree plot

Page 24: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• Choosing initial k centroids

– Running the clustering many times (only trial and error)

– Using hierarchical clustering to locate the centres (why partition based?)

– Finding centres that are farther apart

Page 25: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means, a Basic Clustering Method• K-medoid:

• Bisecting K-means

Page 26: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

The Agglomeration Method• Outline of main steps

1. Take all n data objects as individual clusters and build a n x n dissimilarity matrix. The matrix stores the distance between any pair of data objects.

2. While the number of clusters > 1 do:i. Find a pair of data objects/clusters with the minimum

distance ii. Merge the two data objects/clusters into a bigger clusteriii.Replace the entries in the matrix for the original clusters

or objects by the cluster tag of the newly formed clusteriv.Re-calculate relevant distances and update the matrix

Page 27: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning 27

The Agglomeration Method• Illustration of the method

Page 28: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

The Agglomeration Method• Illustration of the method (dendrogram)

10

# of clusters

9

8

7

6

5

4

3

2

1

Page 29: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

The Agglomeration Method• Agglomeration schemes

– Single link: the distance between two closest points

– Complete link: the distance between two farthest points

– Group average: the average of all pair-wise distances

– Centroids: the distance between the centroids

Page 30: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

The Agglomeration Method• Strengths and weaknesses

– Strengths• Deterministic results• Multiple possible versions of clustering• No need to specify the value of a k beforehand• Can create clusters of arbitrary shapes (single-link)

– Weaknesses• Does not scale up for large data sets• Cannot undo membership like the K-means• Problems with agglomeration schemes (see Chapter 5)

Page 31: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Cluster Evaluation & Interpretation

• Cluster quality– Principle:

• High-level similarity/low-level variation within a cluster

• High-level dissimilarity between clusters

– The measures• Cohesion: sum of squared errors (SSE),

and sum of SSEs for all clusters (WC)• Separation: sum of distances between

clusters (BC)• Combining the cohesion and separation,

the ratio BC/WC is a good indicator of overall quality.

K

kkCSSEWC

1

)(

2

1

),( kKkj

j rrdBC

kCx

kk rxdCSSE 2),()(

WC

BCQ

Ck: cluster krk: centroid of Ck

Page 32: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Cluster Evaluation & Interpretation• Cluster quality illustrated

SubjectID Body Height Body WeightCluster

Tags1 125 61 2s2 178 90 1s3 178 92 1s4 180 83 1s5 167 85 1s6 170 89 1s7 173 98 1s8 135 40 2s9 120 35 2s10 145 70 2s11 125 50 2

SubjectID Body Height Body WeightCluster

Tags1 125 61 2s2 178 90 1s3 178 92 1s4 180 83 1s5 167 85 1s6 170 89 1s7 173 98 1s8 135 40 2s9 120 35 2s10 145 70 2s11 125 50 2

30

40

50

60

70

80

90

100

100 110 120 130 140 150 160 170 180 190

Body HeightB

od

y W

eig

ht

Cluster c2Cluster c1

832741 .)( CSSE 812382 .)( CSSE C1 is a better quality cluster than C2.

6315138123883274 ... WC 33432.BC 2682631513

33432.

.

.Q

Page 33: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Cluster Evaluation & Interpretation• Using cluster quality for clustering

– With K-means:• Add an outer loop for different values of K (from low to high)

• At an iteration, conduct K-means clustering using the current K

• Measure the overall cluster quality and decide whether the resulting cluster quality acceptable

• If not, increase the value of K by 1 and repeat the process

– With agglomeration:• Traverse the hierarchy level by level from the root• At a level, evaluate the overall quality of clusters• If the quality is acceptable, take the clusters at the level as the

final result. If not, move to the next level and repeat the process.

Page 34: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Cluster Evaluation & Interpretation• Cluster tendency

– Cluster tendency: do clusters really exist?– Measures for tendency:

• Quality measure: when BC and WC are similar, it means clusters do not exist.

• Use Hopkins statistic

P: a set of n randomly generated data pointsS: a sample of n data points from the data set

tp: the nearest neighbour of point p in Stm: the nearest neighbour of point m in P

Stpp

Ptmm

Stpp

pm

p

tpd tmd

tpd

SPH

,,

,

),(),(

),(

),(

Page 35: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Cluster Evaluation & Interpretation• Cluster interpretation

– Within cluster• How values of the clustering attributes are distributed• How values of supplementary attributes are distributed

– Outside cluster• Exceptions and anomalies

– Between cluster• Comparative view

Value distributions for the cluster

Value distributions for the population

Value distributions for the cluster

Value distributions for the population

Page 36: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means & Agglomeration in Weka• Clustering in Weka: Preprocess page

Specify all attributes for clustering

Specify “No Class”

Page 37: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means & Agglomeration in Weka• Clustering in Weka: Cluster page

1. Choose a Clustering Solution

2. Set parameters

3. Execute the chosen solution

4. Observe results

5. Select “Visualise Cluster Assignment”

Page 38: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means & Agglomeration in Weka• Clustering in Weka: SimpleKMeans

Specify the value of K

Specify the distance function used

Specify the max. number of iterations

Specify the random seed affecting the

initial random selection of K

centroids

Page 39: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means & Agglomeration in Weka• Clustering in Weka: SimpleKMeans

VisualiseCluster

membership

Save membership into a file

Page 40: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

K-means & Agglomeration in Weka• Clustering in Weka: Agglomeration

Select Cobweb

Tree-shaped Dendrogram

Page 41: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter Summary

• A clustering solution must provide a sensible proximity function, effective algorithm and a cluster evaluation function

• Proximity is normally measured by a distance function that combines measures of value differences upon attributes

• The K-Means method continues to refine prototype partitions until membership changes no longer occur

• The agglomeration method constructs all possible groupings of individual data objects into a hierarchy of clusters

• Good clustering results mean high similarity among members of a cluster and low similarity between members of different clusters

• Normal procedure of clustering in Weka is explained

Page 42: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Four Basic techniques for cluster

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

References

Read Chapter 4 of Data Mining Techniques and Applications

Useful further references• Tan, P-N., Steinbach, M. and Kumar, V. (2006),

Introduction to Data Mining, Addison-Wesley, Chapters 2 (section 2.4) and 8.