# cluster analysis

Post on 02-Jan-2016

19 views

Embed Size (px)

DESCRIPTION

Cluster Analysis. Mark Stamp. Cluster Analysis. Grouping objects in meaningful way Clustered data fits together in some way Can help to make sense of (big) data Useful technique in many fields Many different clustering strategies Overview, then details on 2 methods - PowerPoint PPT PresentationTRANSCRIPT

Intro

Cluster AnalysisCluster Analysis1Mark StampCluster AnalysisGrouping objects in meaningful wayClustered data fits together in some wayCan help to make sense of (big) dataUseful analysis technique in many fields Many different clustering strategiesOverview, then details on 2 methodsK-means simple and can be effectiveEM clustering not as simpleCluster Analysis2Intrinsic vs ExtrinsicIntrinsic clustering relies on unsupervised learningNo predetermined labels on objectsApply analysis directly to dataExtrinsic relies on category labels Requires pre-processing of dataCan be viewed as a form of supervised learningCluster Analysis3Agglomerative vs DivisiveAgglomerative Each object starts in its own clusterClustering merges existing clustersA bottom up approachDivisiveAll objects start in one clusterClustering process splits existing clustersA top down approachCluster Analysis44Hierarchical vs PartitionalHierarchical clusteringChild and parent clustersCan be viewed as dendrogramsPartitional clusteringPartition objects into disjoint clustersNo hierarchical relationshipWe consider K-means and EM in detailThese are both partitionalCluster Analysis5Hierarchical ClusteringExample of a hierarchical approach...start: Every point is its own clusterwhile number of clusters exceeds 1Find 2 nearest clusters and mergeend whileOK, but no real theoretical basisAnd some find that disconcertingEven K-means has some theory behind it

Cluster Analysis6DendrogramExampleObtained by hierarchical clusteringMaybeCluster Analysis7

DistanceDistance between data points?Suppose x = (x1,x2,,xn) and y = (y1,y2,,yn)where each xi and yi are real numbersEuclidean distance isd(x,y) = sqrt((x1-y1)2+(x2-y2)2++(xn-yn)2) Manhattan (taxicab) distance isd(x,y) = |x1-y1| + |x2-y2| + + |xn-yn|

Cluster Analysis8Lots and lots of other distance measures can be considered (as mentioned in a couple of slides).8DistanceEuclidean distance red lineManhattan distance blue or yellowOr any similar right-angle only path

Cluster Analysis9abDistanceLots and lots more distance measuresOther examples includeMahalanobis distance takes mean and covariance into accountSimple substitution distance measure of decryption distance Chi-squared distance statistical Or just about anything you can think ofCluster Analysis10One Clustering ApproachGiven data points x1,x2,x3,,xm Want to partition into K clustersI.e., each point in exactly one clusterA centroid specified for each clusterLet c1,c2,,cK denote current centroidsEach xi associated with one centroidLet centroid(xi) be centroid for xi If cj = centroid(xi), then xi is in cluster j Cluster Analysis11ClusteringTwo crucial questionsHow to determine centroids, cj?How to determine clusters, that is, how to assign xi to centroids?But first, what makes a cluster good?For now, focus on one individual clusterRelationship between clusters laterWhat do you think?Cluster Analysis12DistortionIntuitively, compact clusters goodDepends on data and K, which are givenAnd depends on centroids and assignment of xi to clusters, which we can controlHow to measure goodness?Define distortion = d(xi,centroid(xi))Where d(x,y) is a distance measureGiven K, lets try to minimize distortionCluster Analysis13Recall that K is the number of clusters.13DistortionConsider this 2-d dataChoose K = 3 clustersSame data for bothWhich has smaller distortion?How to minimize distortion?Good questionCluster Analysis14DistortionNote, distortion depends on KSo, should probably write distortionK Problem we want to solveGiven: K and x1,x2,x3,,xm Minimize: distortionK Best choice of K is a different issueBriefly considered laterFor now, assume K is given and fixed

Cluster Analysis15How to Minimize Distortion?Given m data points and K Minimize distortion via exhaustive search?Try all m choose K different cases? Too much work for realistic size data setAn approximate solution will have to doExact solution is NP-complete problemImportant Observation: For min distortionEach xi grouped with nearest centroidCentroid must be center of its group

Cluster Analysis16The penultimate point is obvious. The last point is not so obvious, but can be verified using some basic calculus (minimize distortion function by taking partial derivatives and set to 0)16K-MeansPrevious slide implies that we can improve suboptimal cluster by eitherRe-assign each xi to nearest centroidRe-compute centroids so theyre centersNo improvement from applying either 1 or 2 more than once in successionBut alternating might be usefulIn fact, that is the K-means algorithmCluster Analysis17K-Means AlgorithmGiven datasetSelect a value for K (how?)Select initial centroids (how?)Group data by nearest centroidRecompute centroids (cluster centers)If significant change, then goto 3; else stopCluster Analysis18K-Means AnimationVery good animation herehttp://shabal.in/visuals/kmeans/2.htmlNice animations of movement of centroids in different cases herehttp://www.ccs.neu.edu/home/kenb/db/examples/059.html(near bottom of web page)Other?Cluster Analysis19K-MeansAre we assured of optimal solution?Definitely notWhy not?For one, initial centroid locations criticalThere is a (sensitive) dependence on initial conditionsThis is a common issue in iterative processes (HMM training, is an example)Cluster Analysis20K-Means InitializationRecall, K is the number of clustersHow to choose K?No obvious best way to do soBut K-means is fast So trial and error may be OKThat is, experiment with different K Similar to choosing N in HMMIs there a better way to choose K?

Cluster Analysis21But, to sensibly choose best K, we need some way to measure the quality of resulting clusters.21Optimal K?Even for trial and error, we need a way to measure goodness of resultsChoosing optimal K is trickyMost intuitive measures will tend to improve for larger KBut K too big may overfit dataSo, when is K big enough?But not too big

Cluster Analysis22Schwarz CriterionChoose K that minimizesf(K) = distortionK + dK log mWhere d is the dimension, m is the number of data points, and is ???Recall that distortion depends on KTends to decrease as K increasesEssentially, adding a penalty as K increasesRelated to Bayes Information Criterion (BIC)And some other similar thingsConsider choice of K in more detail laterCluster Analysis23Kf(K)K-Means InitializationHow to choose initial centroids?Again, no best way to do thisCounterexamples to any best approachOften just choose at randomOr uniform/maximum spacing Or some variation on this ideaOther?

Cluster Analysis24K-Means InitializationIn practice, we might do followingTry several different choices of KFor each K, test several initial centroidsSelect the result that is bestHow to measure best?We look at that nextMay not be very scientificBut often its good enoughCluster Analysis25K-Means VariationsOne variation is K-mediodsCentroids point must be actual data pointFuzzy K-meansIn K-means, any data point is in one cluster and not in any otherIn fuzzy case, data point can be partly in several different clusters Degree of membership vs distanceMany other variationsCluster Analysis26Measuring Cluster QualityHow can we judge clustering results?In general, that is, not just for K-meansCompare to typical training/scoringSuppose we test new scoring methodE.g., score malware and benign filesCompute ROC curves, AUC, etc.Many tools to measure success/accuracyClustering is different (Why? How?)Cluster Analysis27Clustering QualityClustering is a fishing expeditionNot sure what we are looking forHoping to find structure, data discoveryIf we know answer, no point to clusteringMight find something thats not thereEven random data can be clusteredSome things to consider on next slidesRelative to the data to be clusteredCluster Analysis28Cluster-ability?Clustering tendencyHow suitable is dataset for clustering?Which dataset below is cluster-friendly?We can always apply clusteringbut expect better results in some casesCluster Analysis29ValidationExternal validationCompare clusters based on data labelsSimilar to usual training/scoring scenarioGood idea if know something about dataInternal validationDetermine quality based only on clustersE.g., spacing between and within clustersHarder to do, but always applicableCluster Analysis30Its All RelativeComparing clustering resultsThat is, compare one clustering result with others for same datasetCan be very useful in practiceOften, lots of trial and errorCould enable us to hill climb to better clustering resultsbut still need a way to quantify things Cluster Analysis31How Many Clusters?Optimal number of clusters?Already mentioned this wrt K-meansBut what about the general case?I.e., not dependent on cluster techniqueCan the data tell us how many clusters?Or the topology of the clusters?Next, we consider relevant measures

Cluster Analysis32Internal ValidationDirect measurement of clustersMight call it topological validationWell consider the followingCluster correlationSimilarity matrixSum of squares errorCohesion and separationSilhouette coefficientCluster Analysis33Correlation CoefficientFor X=(x1,x2,,xn) and Y=(y1,y2,,yn) Correlation coefficient rXY is

Can show -1 rXY 1If rXY > 0 then positive cor (and vice versa)Magnitude is strength of correlationCluster Analysis34

Why inverse correlation?34Examples of rXY in 2-dCluster Analysis35

Cluster CorrelationGiven data x1,x2,,xm, and clusters, define 2 matricesDistance matrix D = {dij} Where dij is distance between xi and xj Adjacency matrix A = {aij} Where aij is 1 if xi and xj in same clusterAnd aij is 0 otherwiseNow what?Cluster Analysis36Cluster CorrelationCompute correlation between D and A

High inverse correlation implies nearby things clustered togetherWhy inverse?Cluster Analysis37

Why inverse correlation? For large elements of A (i.e., aij = 1) we want the corresponding distance dij to be r