cluster analysis

Cluster Analysis Amandeep Singh 30 June 2014 1 Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters. Objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters. Cluster analysis is also called classification analysis, or numerical taxonomy. Both cluster analysis and discriminant analysis are concerned with classification. However, discriminant analysis requires prior knowledge of the cluster or group membership for each object or case included, to develop the classification rule. In contrast, in cluster analysis there is no a priori information about the group or cluster membership for any of the objects. Groups or clusters are suggested by the data, not defined a priori. 30 June 20142 30 June 20143 30 June 20144 30 June 20145 30 June 20146 30 June 20147 30 June 20148 30 June 20149 30 June 201410 30 June 201411 30 June 201412 30 June 201413 30 June 201414 30 June 201415 30 June 201416 30 June 201417 The k-means algorithm The k-means algorithm is perhaps the most often used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. The goal is to minimize the differences within each cluster and maximize the differences between clusters. This involves assigning each of the n examples to one of the k clusters, where k is a number that has been defined ahead of time 30 June 201418 30 June 201419 The k-means algorithm The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. The points are chosen by selecting k random examples from the training dataset.30 June 201420 The k-means algorithm Because we hope to identify three clusters, k = 3 point sare selected. 30 June 201421 The k-means algorithm Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is as follows: 30 June 201422 The k-means algorithm The three cluster centers partition the examples into three segments labeled Cluster A, B, and C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. 30 June 201423 The k-means algorithm A Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all three boundaries meet is the maximal distance from all three cluster centers. 30 June 201424 The k-means algorithm The initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the mean value of the points currently assigned to that cluster.30 June 201425 The k-means algorithm 30 June 201426 The k-means algorithm Because the cluster boundaries have been adjusted according to the repositioned centers, Cluster A is able to claim an additional example from Cluster B (indicated by an arrow).30 June 201427 The k-means algorithm Two more points have been reassigned from Cluster B to Cluster A during this phase, as they are now closer to the centroid for A than B. This leads to another update as shown: 30 June 201428 The k-means algorithm Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm can be sensitive to randomly chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. 30 June 201429 Ideally, you will have some a priori knowledge (that is, a prior belief) about the true groupings, and you can begin applying k-means using this information. For example For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. 30 June 201430 Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals 30 June 201431 A technique known as the elbow method attempts to gauge how the homogeneityor heterogeneity within the clusters changes for various values of k. The homogeneity within clusters is expected to increase as additional clusters are added; Similarly, heterogeneity will also continue to decrease with more clusters. Because you could continue to see improvements until each example is in its own cluster, The goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k such that there are diminishing returns beyond a point.30 June 201432 This value of k is known as the elbow point, because it looks like an elbow. 30 June 201433 30 June 2014 34 Cluster Analysis

cluster analysis

Documents