2004/05/03 clustering 1 clustering (part one) ku-yaw chang [email protected] assistant...

29
2004/05/03 2004/05/03 Clustering Clustering 1 Clustering Clustering (Part One) (Part One) Ku-Yaw Chang Ku-Yaw Chang [email protected] [email protected] Assistant Professor, Department of Assistant Professor, Department of Computer Science and Information Engineering Computer Science and Information Engineering Da-Yeh University Da-Yeh University

Upload: jasper-wright

Post on 18-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

2004/05/032004/05/03 ClusteringClustering 11

ClusteringClustering(Part One)(Part One)

Ku-Yaw ChangKu-Yaw [email protected]@mail.dyu.edu.tw

Assistant Professor, Department of Assistant Professor, Department of Computer Science and Information EngineeringComputer Science and Information Engineering

Da-Yeh UniversityDa-Yeh University

Page 2: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

222004/05/032004/05/03 ClusteringClustering

OutlineOutline

IntroductionIntroduction

Hierarchical ClusteringHierarchical Clustering

Partitional ClusteringPartitional Clustering

Page 3: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

332004/05/032004/05/03 ClusteringClustering

IntroductionIntroduction

Supervised learningSupervised learning Training setTraining set

Unsupervised learningUnsupervised learning Divide samples into naturally occurring groups Divide samples into naturally occurring groups

or clusters based on measures of similarity or clusters based on measures of similarity without any prior knowledge of class without any prior knowledge of class membershipmembership

Page 4: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

442004/05/032004/05/03 ClusteringClustering

IntroductionIntroduction

ClusteringClustering Grouping samples so that the samples are Grouping samples so that the samples are

similar within each group.similar within each group.The groups are called clusters.The groups are called clusters.

In image analysisIn image analysisBe used to find groups of pixels with similar gray Be used to find groups of pixels with similar gray levels, colors, or local textureslevels, colors, or local textures

To discover various regions in the imageTo discover various regions in the image

Page 5: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

552004/05/032004/05/03 ClusteringClustering

IntroductionIntroduction

Hierarchical ClusteringHierarchical Clustering From bottom to topFrom bottom to top

Partitional ClusteringPartitional Clustering From top to bottomFrom top to bottom The number of clusters to be constructed is The number of clusters to be constructed is

specified in advance.specified in advance.

Page 6: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

662004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

A hierarchy can be represented by a tree A hierarchy can be represented by a tree structure.structure.

Animals

Dogs Cats

Large Small

St. Bernard Labrador

LongHair

ShortHair

0

1 2 3 4 5

1

2

3

4

5

Level

Page 7: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

772004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

A clustering process that organizes the A clustering process that organizes the data into large groups, which contains data into large groups, which contains smaller groups, and so on.smaller groups, and so on.

May be drawn as a May be drawn as a treetree or or dendrogramdendrogram..

The finest groupThe finest group At the bottom of the dendrogramAt the bottom of the dendrogram

The coarsest groupThe coarsest group At the top of the dendrogramAt the top of the dendrogram

Page 8: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

882004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

At level 0At level 0 {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}

At level 1At level 1 {1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}

At level 2At level 2 {1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}

At level 3At level 3 {1, 2, 3}, {4, 5}{1, 2, 3}, {4, 5}

At level 4At level 4 {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}

Animals

Dogs Cats

Large Small

St.Bernard Labrador

LongHair

ShortHair

0

1 2 3 4 5

1

2

3

4

5

Level

Page 9: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

992004/05/032004/05/03 ClusteringClustering

Agglomerative Clustering AlgorithmAgglomerative Clustering Algorithm

1.1. Begin with Begin with nn clusters, each of one clusters, each of one sample.sample.

2.2. Repeat step 3 a total of Repeat step 3 a total of nn-1 times-1 times

3.3. Find the most similar clusters Find the most similar clusters CCii and and CCjj

and merge and merge CCii and and CCjj into one cluster. into one cluster.

If there is a tie, merge the first pair found.If there is a tie, merge the first pair found.

Page 10: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

10102004/05/032004/05/03 ClusteringClustering

Hierarchical Clustering AlgorithmHierarchical Clustering Algorithm

Different methods to determine the Different methods to determine the similarity of clusters.similarity of clusters. Define a function that measures distance Define a function that measures distance

between clustersbetween clusters

The most popular distance measures are The most popular distance measures are Euclidean distanceEuclidean distance and and city block city block distancedistance..

Page 11: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

11112004/05/032004/05/03 ClusteringClustering

Euclidean DistanceEuclidean Distance

n-dimensional feature spacen-dimensional feature space The distance between two points a = (aThe distance between two points a = (a11, …, a, …, ann) )

and b = (band b = (b11, …, b, …, bnn) is defined by) is defined by

To save computing time, the square root To save computing time, the square root would not actually be performed.would not actually be performed.

n

iiie abbad

1

2)(),(

Page 12: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

12122004/05/032004/05/03 ClusteringClustering

City Block DistanceCity Block Distance

The sum of the absolute differences in each The sum of the absolute differences in each feature.feature.

Also calledAlso called Manhattan metricManhattan metric Taxicab distanceTaxicab distance

n

iiicb abbad

1

),(

Page 13: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

13132004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Single-Linkage AlgorithmThe Single-Linkage Algorithm

Also known as Also known as The minimum method The minimum method The nearest neighbor methodThe nearest neighbor method

The distance between two clustersThe distance between two clusters The The smallest distancesmallest distance between two points such that between two points such that

one point is each clusterone point is each cluster

),(min),(,

badCCDji CbCa

jiSL

Page 14: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

14142004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Single-Linkage AlgorithmThe Single-Linkage Algorithm

Use Euclidean distanceUse Euclidean distance {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}

XX YY

11 44 44

22 88 44

33 1515 88

44 2424 44

55 2424 1212

11 22 33 44 55

11 -- 4.04.0 11.711.7 20.020.0 21.521.5

22 4.04.0 -- 8.18.1 16.016.0 17.917.9

33 11.711.7 8.18.1 -- 9.89.8 9.89.8

44 20.020.0 16.016.0 9.89.8 -- 8.08.0

55 21.521.5 17.017.0 9.89.8 8.08.0 --

Page 15: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

15152004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Single-Linkage AlgorithmThe Single-Linkage Algorithm

{1,2}{1,2} 33 44 55

{1,2}{1,2} -- 8.18.1 16.016.0 17.917.9

33 8.18.1 -- 9.89.8 9.89.8

44 16.016.0 9.89.8 -- 8.08.0

55 17.917.9 9.89.8 8.08.0 --

{1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}

Page 16: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

16162004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Single-Linkage AlgorithmThe Single-Linkage Algorithm

{1,2}{1,2} 33 {4,5}{4,5}

{1,2}{1,2} -- 8.18.1 16.016.0

33 8.18.1 -- 9.89.8

{4,5}{4,5} 16.016.0 9.89.8 --

{1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}

Page 17: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

17172004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Single-Linkage AlgorithmThe Single-Linkage Algorithm

{1,2,3}{1,2,3} {4,5}{4,5}

{1,2,3}{1,2,3} -- 9.89.8

{4,5}{4,5} 9.89.8 --

{1, 2, 3}, {4, 5}{1, 2, 3}, {4, 5} {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}

Page 18: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

18182004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm

Also known as Also known as The maximum method The maximum method The farthest neighbor methodThe farthest neighbor method

The distance between two clustersThe distance between two clusters The The largest distancelargest distance between two points such that between two points such that

one point is each clusterone point is each cluster

),(max),(,

badCCDji CbCa

jiCL

Page 19: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

19192004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm

Use Euclidean distanceUse Euclidean distance {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}

XX YY

11 44 44

22 88 44

33 1515 88

44 2424 44

55 2424 1212

11 22 33 44 55

11 -- 4.04.0 11.711.7 20.020.0 21.521.5

22 4.04.0 -- 8.18.1 16.016.0 17.917.9

33 11.711.7 8.18.1 -- 9.89.8 9.89.8

44 20.020.0 16.016.0 9.89.8 -- 8.08.0

55 21.521.5 17.017.0 9.89.8 8.08.0 --

Page 20: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

20202004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm

{1,2}{1,2} 33 44 55

{1,2}{1,2} -- 11.711.7 20.020.0 21.521.5

33 11.711.7 -- 9.89.8 9.89.8

44 20.020.0 9.89.8 -- 8.08.0

55 21.521.5 9.89.8 8.08.0 --

{1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}

Page 21: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

21212004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm

{1,2}{1,2} 33 {4,5}{4,5}

{1,2}{1,2} -- 11.711.7 21.521.5

33 11.711.7 -- 9.89.8

{4,5}{4,5} 21.521.5 9.89.8 --

{1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}

Page 22: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

22222004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Single-Linkage AlgorithmThe Single-Linkage Algorithm

{1,2}{1,2} {3,4,5}{3,4,5}

{1,2}{1,2} -- 21.521.5

{3,4,5}{3,4,5} 21.521.5 --

{1, 2} , {3, 4, 5}{1, 2} , {3, 4, 5} {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}

Page 23: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

23232004/05/032004/05/03 ClusteringClustering

ProblemProblem

A cluster contains three samples at (0,1), (0,2), A cluster contains three samples at (0,1), (0,2), and (0,3). Another cluster contains samples at and (0,3). Another cluster contains samples at (1,7), (1,8), and (1,9).(1,7), (1,8), and (1,9).

(a) What is the single-linkage distance between the (a) What is the single-linkage distance between the clusters if city block distance is used?clusters if city block distance is used?

(b) What is the single-linkage distance between the (b) What is the single-linkage distance between the clusters if Euclidean distance is used?clusters if Euclidean distance is used?

(c) What is the complete-linkage distance between the (c) What is the complete-linkage distance between the clusters if city block distance is used?clusters if city block distance is used?

(d) What is the complete-linkage distance between the (d) What is the complete-linkage distance between the clusters if Euclidean distance is used?clusters if Euclidean distance is used?

Page 24: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

24242004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Average-Linkage AlgorithmThe Average-Linkage Algorithm

Also known as UPGMAAlso known as UPGMA Unweighted pairgroup method using arithmetic Unweighted pairgroup method using arithmetic

averagesaverages

The distance between two clustersThe distance between two clusters The The average distanceaverage distance between two points such that between two points such that

one point is each clusterone point is each cluster

ji CbCaji

jiAL badnn

CCD,

),(1

),(

Page 25: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

25252004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Average-Linkage AlgorithmThe Average-Linkage Algorithm

Use Euclidean distanceUse Euclidean distance {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}

XX YY

11 44 44

22 88 44

33 1515 88

44 2424 44

55 2424 1212

11 22 33 44 55

11 -- 4.04.0 11.711.7 20.020.0 21.521.5

22 4.04.0 -- 8.18.1 16.016.0 17.917.9

33 11.711.7 8.18.1 -- 9.89.8 9.89.8

44 20.020.0 16.016.0 9.89.8 -- 8.08.0

55 21.521.5 17.017.0 9.89.8 8.08.0 --

Page 26: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

26262004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Average-Linkage AlgorithmThe Average-Linkage Algorithm

{1,2}{1,2} 33 44 55

{1,2}{1,2} -- 9.99.9 18.018.0 19.719.7

33 9.99.9 -- 9.89.8 9.89.8

44 18.018.0 9.89.8 -- 8.08.0

55 19.719.7 9.89.8 8.08.0 --

{1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}

Page 27: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

27272004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Average-Linkage AlgorithmThe Average-Linkage Algorithm

{1,2}{1,2} 33 {4,5}{4,5}

{1,2}{1,2} -- 9.99.9 18.918.9

33 9.99.9 -- 9.89.8

{4,5}{4,5} 18.918.9 9.89.8 --

{1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}

Page 28: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

28282004/05/032004/05/03 ClusteringClustering

Hierarchical ClusteringHierarchical Clustering

The Average-Linkage AlgorithmThe Average-Linkage Algorithm

{1,2}{1,2} {3,4,5}{3,4,5}

{1,2}{1,2} -- 14.414.4

{3,4,5}{3,4,5} 14.414.4 --

{1, 2} , {3, 4, 5}{1, 2} , {3, 4, 5} {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}

Page 29: 2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang canseco@mail.dyu.edu.tw Assistant Professor, Department of Computer Science and Information

29292004/05/032004/05/03 ClusteringClustering

ProblemProblem

Compute the average-linkage distance Compute the average-linkage distance between the two clusters { (3,4), (5,6) } between the two clusters { (3,4), (5,6) } and { (1,1), (2,2) }and { (1,1), (2,2) }

(a) Using city block distance between points.(a) Using city block distance between points.

(b) Using Euclidean distance between points. (b) Using Euclidean distance between points.