cluster analysis

Download Cluster Analysis

Post on 27-Jan-2015

2.470 views

Category:

Education

3 download

Embed Size (px)

DESCRIPTION

AACIMP 2011 Summer School. Operational Research Stream. Lecture by Erik Kropat.

TRANSCRIPT

  • 1. Summer SchoolAchievements and Applications of Contemporary Informatics, Mathematics and Physics (AACIMP 2011)August 8-20, 2011, Kiev, UkraineCluster Analysis Erik Kropat University of the Bundeswehr MunichInstitute for Theoretical Computer Science,Mathematics and Operations ResearchNeubiberg, Germany

2. The Knowledge DiscoveryProcess 3. PATTERNEVALUATIONKnowledgeDATA MININGPatterns Strategic planningPRE- PROCESSINGPreprocessed Data Patterns, clusters, correlationsautomated classificationRaw outlier / anomaly detection Standardizingassociation rule learningData Missing values / outliers 4. Clustering 5. Clustering is a tool for data analysis, which solves classification problems.ProblemGiven n observations, split them into K similar groups.QuestionHow can we define similarity ? 6. SimilarityA cluster is a set of entities which are alike,and entities from different clusters are not alike. 7. DistanceA cluster is an aggregation of points such thatthe distance betweenany two points in the clusteris less thanthe distance betweenany point in the cluster and any point not in it. 8. DensityClusters may be described asconnected regions of a multidimensional spacecontaining a relatively high density of points,separated from other such regions by a regioncontaining a relatively low density of points. 9. Min Max-ProblemHomogeneity: Objects within the same cluster should be similar to each other.Separation: Objects in different clusters should be dissimilar from each other.Distance between clusters Distance between objects similarity distance 10. Types of ClusteringClusteringHierarchical PartitionalClustering Clusteringagglomerativedivisive 11. Similarity and Distance 12. Distance MeasuresA metric on a set G is a function d: G x G R+ that satisfies the followingconditions:(D1)d(x, y) = 0 x=y(identity)(D2)d(x, y) = d(y, x) 0for all x, y G (symmetry & non-negativity)(D3)d(x, y) d(x, z) + d(z, y) for all x, y, z G (triangle inequality)zyx 13. ExamplesMinkowski-Distance1_n rr d r (x, y) = | xi yi |i=1, r [1, ) , x, y Rn.o r = 1: Manhatten distanceo r = 2: Euklidian distance 14. Euclidean Distance1_n2 2d2 (x, y) = ( xi yi ) , x, y Rni=1yx = (1, 1) y = (4, 3) _ 1 ____ 2 2 2 d2 (x, y) = (1 - 4) + (1 - 3) = 13 x 15. Manhatten Distancend1 (x, y) = | xi yi | , x, y Rni=1 yx = (1, 1)y = (4, 3)d1 (x, y) = | 1 - 4 | + | 1 - 3 | = 3 + 2 = 5 x 16. Maximum Distanced (x, y) = max | xi yi | , x, y Rn1iny x = (1, 1)y = (4, 3)d (x, y) = max (3, 2) = 3x 17. Similarity MeasuresA similarity function on a set G is a function S: G x G R that satisfies thefollowing conditions:(S1)S (x, y) 0 for all x, y G (positive defined)(S2)S (x, y) S (x, x)for all x, y G (auto-similarity)(S3)S (x, y) = S (x, x) x=yfor all x, y G (identity)The value of the similarity function is greater when two points are closer. 18. Similarity Measures There are many different definitions of similarity. Often used(S4) S (x, y) = S (y, x) for all x, y G (symmetry) 19. Hierachical Clustering 20. Dendrogram Cluster DendrogramEuclidean distance (complete linkage)Euclidean distance(complete linkage) Gross national product of EU countries agriculture (1993) www.isa.uni-stuttgart.de/lehre/SAHBD 21. Hierarchical ClusteringHierarchical clustering creates a hierarchy of clusters of the set G.HierarchicalClustering agglomerative divisiveAgglomerative clustering:Clusters are successively merged togetherDivisive clustering: Clusters are recursively split 22. Agglomerative ClusteringMerge clusters with smallest distance between the two clusters Step 3 e1, e2 , e3, e41 cluster Step 2 e1, e2, e3e4 2 clusters Step 1e1, e2 e3 e43 clusters Step 0 e1e2e3 e44 clusters 23. Divisive ClusteringChose a cluster, that optimally splits in two particular clustersaccording to a given criterion.Step 0 e1, e2 , e3, e41 clusterStep 1e1, e2 e3, e4 2 clustersStep 2e1, e2 e3 e43 clustersStep 3 e1e2e3 e44 clusters 24. Agglomerative Clustering 25. INPUTGivenn objects G = { e1,...,en }represented by p-dimensional feature vectorsx1,...,xn RpFeature pFeature 1Feature 2Feature 3 Object x1 = (x11x12 x13 ...x1p ) x2 = (x21x22 x23 ...x2p ) xn = (xn1xn2 xn3 ...xnp ) 26. Example IAn online shop collects data from its customers.For each of the n customers it exists a p-dimensional feature vectorObject 27. Example IIIn a clinical trial laboratory values of a large number of patients are gathered.For each of the n patients it exists a p-dimensional feature vector 28. Agglomerative Algorithms Begin with disjoint clusteringC1 = { {e1}, {e2}, ... , {en} } Terminate when all objects are in one clusterCn = { {e1, e2, ... , en} } e1 e2 e3 e4 Iterate find the most similar pair of clustersand merge them into a single cluster.Sequence of clusterings (Ci )i=1,...n of G with Ci 1 Ci for i = 2,...,n. 29. What is the distance between two clusters?A d (A,B)B Various hierarchical clustering algorithms 30. Agglomerative Hierarchical ClusteringThere exist many metrics to measure the distance between clusters.They lead to particular agglomerative clustering methods: Single-Linkage Clustering Complete-Linkage Clustering Average Linkage Clustering Centroid Method ... 31. Single-Linkage ClusteringNearest-Neighbor-MethodThe distance between the clusters A und B is theminimum distance between the elements of each cluster: d(A,B) = min { d (a, b) | a A, b B }a d(A,B)b 32. Single-Linkage Clustering Advantage: Can detect very long and even curved clusters. Can be used to detect outliers. Drawback:Chaining phenomen Clusters that are very distant to each other may be forced together due to single elements being close to each other.BCA 33. Complete-Linkage ClusteringFurthest-Neighbor-MethodThe distance between the clusters A and B is themaximum distance between the elements of each cluster: d(A,B) = max { d(a,b) | a A, b B } a d (A, B) b 34. Complete-Linkage Clustering tends to find compact clusters of approximately equal diameters. avoids the chaining phenomen. cannot be used for outlier detection. 35. Average-Linkage ClusteringThe distance between the clusters A and B is the meandistance between the elements of each cluster: 1 d (A, B) = d (a, b)|A| |B| a A,bBa bA Bd(A,B) 36. Centroid MethodThe distance between the clusters A and B is the(squared) Euclidean distance of the cluster centroids.d (A, B) x x 37. Agglomerative Hierarchical Clusteringd (A, B)d (A, B) ad (A, B) d (A, B) b d (A, B)x(A, B)x 38. Bioinformatics Alizadeh et al., Nature 403 (2000): pp.503511 39. Exercise Berlin Kiev Paris Odessa 40. ExerciseThe following table shows the distances between 4 cities:Kiev Odessa Berlin ParisKiev 440 1200 2000Odessa 4401400 2100Berlin 1200 1400 900Paris2000 2100 900 Determine a hierarchical clustering withthe single linkage method. 41. Solution - Single LinkageStep 0: Clustering {Kiev}, {Odessa}, {Berlin}, {Paris}Distances between clusters Kiev Odessa Berlin ParisKiev 440 1200 2000Odessa4401400 2100Berlin 12001400 900Paris20002100 900 42. Solution - Single LinkageStep 0: Clustering {Kiev}, {Odessa}, {Berlin}, {Paris}Distances between clusters minimal distance Kiev Odessa Berlin ParisKiev 440 1200 2000Odessa4401400 2100Berlin 12001400 900Paris20002100 900 Merge clusters { Kiev } and { Odessa } Distance value: 440 43. Solution - Single LinkageStep 1: Clustering {Kiev, Odessa}, {Berlin}, {Paris}Distances between clustersKiev, Odessa Berlin Paris Kiev, Odessa 1200 2000 Berlin1200 900 Paris 2000 900 44. Solution - Single LinkageStep 1: Clustering {Kiev, Odessa}, {Berlin}, {Paris}Distances between clusters minimal distanceKiev, Odessa Berlin Paris Kiev, Odessa 1200 2000 Berlin1200 900 Paris 2000 900Merge clusters { Berlin } and { Paris } Distance value: 900 45. Solution - Single LinkageStep 2: Clustering {Kiev, Odessa}, {Berlin, Paris}Distances between clustersminimal distance Kiev, Odessa Berlin, Paris Kiev, Odessa 1200 Berlin, Paris1200Merge clusters { Kiev, Odessa } and { Berlin, Paris } Distance value: 1200 46. Solution - Single LinkageStep 3: Clustering {Kiev, Odessa, Berlin, Paris} 47. Solution - Single LinkageHierarchyDistancevalues25401 cluster120013402 clusters900440 3 clusters440 04 clusters Kiev Odessa Berlin Paris 48. Divisive Clustering 49. Divisive Algorithms Begin with one clusterC1 = { {e1, e2, ... , en} }e1 e2 e3 e4 Terminate when all objects are in disjoint clustersCn = { {e1}, {e2}, ... , {en} } Iterate Chose a cluster Cf , that optimally splitstwo particular clusters Ci and Cjaccording to a given criterion.Sequence of clusterings (Ci )i=1,...n of G withC i C i + 1 for i = 1,...,n-1. 50. Partitional Clustering Minimal Distance Methods 51. Partitional ClusteringK=2 Aims to partition n observations into K clusters. The number of clusters andinitial partitionan initial partition are given. The initial partition is considered asnot optimal and should be K=2iteratively repartitioned.The number of clusters is given !!! final partition 52. Partitional ClusteringDifference to hierarchical clustering number of clusters is fixed. an object can change the cluster.Initial partition is obtained by random or the application of an hierarchical clustering algorithm in advance.Estimation of the number of clusters specialized methods (e.g., Silhouette) or the application of an hierarchical clustering algorithm in advance. 53. Partitional Clustering - MethodsIn this course we will introduce the minimal distance methods . . . K-Means and Fuzzy-C-Means 54. K-Means 55. K-MeansAims to partition n observations into K clustersin which each observation belongs to the cluster with the nearest mean.GC3Find K cluster centroids 1 ,..., Kthat minimize the objective function KC1 2J= dist ( i, x )i =1 x Ci C2 56. K-MeansAims to partition n observations into K clustersin which each observation belongs to the cluster with the nearest mean.GC3Find K cluster centroids 1 ,..., Kthat minimi