data analysis05 clustering

http://publicationslist.org/junio

Data AnalysisClustering

Prof. Dr. Jose Fernando Rodrigues JuniorICMC-USP


What is it about?Clustering refers to the process of finding groups of points

that are in some way “lumped together”

A modality of unsupervised learning, as we do not knowahead of time where and what are the clusters – no training!

Explanatorily tries to characterize the structure of a dataset


But, what is a cluster?groups of points that are similargroups of points that are close to each othergroups well-separated one from each othercontiguous regions of high data point density separated by

regions of lower point density


But, what is a cluster?

Any clusters here? There should not be, as they are uniformly (no two points overlap, yet)generated points. Eventhough, most algorithms would point out some clusters.It is not that there are clusters there, it is only that we do not have enough points yet.



Any clusters here? There should not be, as they are uniformly (no two points overlap, yet)generated points. Eventhough, most algorithms would point out some clusters.It is not that there are clusters there, it is only that we do not have enough points yet.

The point here is – although one would find clusters, they definitely do not explain the phenomenon accurately.



Yes! Three clusters, I can see them. Distance-based algorithms can do well here.Easy huh?! No wonder, here we have convex, disjoint, and well-separated groups of points.

Try the next ones!



Non-convex clusters – simple distance-based algorithms would have trouble here.A cluster is convex if the line connecting any two points lies entirely within the cluster

itself.



Non-convex clusters – simple distance-based algorithms would have trouble here.A cluster is convex if the line connecting any two points lies entirely within the cluster

itself.

There are also the star-convex clusters: in such case, the connecting line fromthe spatial center of the cluster to any other point lies entirely within thecluster.



Intersecting clusters – quite a challenge!



No general clustering algorithm can solve this. The clustering is given by the globalproperties observed in the points – distance or neighbor based algorithms would yield a

single cluster.



No general clustering algorithm can solve this. The clustering is given by the globalproperties observed in the points – distance or neighbor based algorithms would yield a

single cluster.

In this case, for any algorithm that considers a single point (or a single pair ofpoints) at a time, this leads to a problem: to determine cluster membership,we need the property of the whole cluster; but to determine the properties(vertical, horizontal, and pairwise orthogonal) of the cluster, we must firstassign points to clusters.


But, what is a cluster?To handle such situations, we would need to perform some

kind of global structure analysis—a task our minds areincredibly good at (which is why we tend to think of clustersthis way) but that we have a hard time teaching computers todo

For problems in two dimensions, digital image processing hasdeveloped methods to recognize and extract certain features(such as edge detection)

But general clustering methods deal only with local propertiesand therefore can’t handle problems such as these


But, what is a cluster?If we return to our candidate definitions of cluster, we can

verify that none of them survives the possibilities justpresented – try it!

groups of points that are similargroups of points that are close to each othergroups well-separated one from each othercontiguous regions of high data point density separated by regions

of lower point density


But, what is a cluster?If we return to our candidate definitions of cluster, we can

verify that none of them survives the possibilities justpresented – try it!

groups of points that are similargroups of points that are close to each othergroups of points well-separated one from each othercontiguous regions of high data point density separated by regions

of lower point density

So this is it.

• No mathematical, nor universal definition of a cluster

• Rather, we have our intuition and it could be quite useful provided we have agood comprehension of the data properties – structural, statistical, and domain-related

• Having, as much as possible, well-defined goals is also a demand

• Just as for any other data analysis approach, do not try to use it as a magic blackbox – doing so will fail with high probability!


DistancesClustering does not actually require data points to be

embedded into a geometric space: all that is required is adistance or (equivalently) a similarity measure for any pair ofpoints

This makes it possible to perform clustering on a set ofstrings, for example

However, if the data points have properties of a vector spacethen we can develop more efficient algorithms that exploitthese properties


Distances – what is it?A distance is any function d(x, y) that takes two points and returns

a scalar value that is a measure for how different these points are:the more different, the larger the distance

A distance function – or, a similarity function: s(x, y) = 1-d(x,y), for 0 ≤ d(x,y) ≤ 1 s(x,y) = 1/d(x,y) s(x,y) = e-d

For some problems, a particular distance measure will present itselfnaturally - if the data points are points in space, then we will mostlikely employ the Euclidean distance or a measure similar to it, butfor other problems, we have more freedom to define our ownmetric


Distances – metric distances There are certain properties that a distance (or similarity) function

should have. Mathematicians have developed a set of propertiesthat a function must possess to be considered a metric (ordistance) in a mathematical sense

d(x, y) = 0 d(x, y) = 0 if and only if x = y d(x, y) = d(y, x) d(x, y) + d(y, z) ≥ d(x, z)

These conditions are not necessarily fulfilled in practice. A funnyexample for an asymmetric distance occurs if you ask everyone in agroup of people how much they like every other member of thegroup and then use the responses to construct a distance measure:it is not at all guaranteed that the feelings of person A for person Bare requited by B


Distances – metric distances There are certain properties that a distance (or similarity) function

should have. Mathematicians have developed a set of propertiesthat a function must possess to be considered a metric (ordistance) in a mathematical sense.

d(x, y) = 0 d(x, y) = 0 if and only if x = y d(x, y) = d(y, x) d(x, y) + d(y, z) ≥ d(x, z)

These conditions are not necessarily fulfilled in practice. A funnyexample for an asymmetric distance occurs if you ask everyone in agroup of people how much they like every other member of thegroup and then use the responses to construct a distance measure:it is not at all guaranteed that the feelings of person A for person Bare requited by B

For technical reasons, the symmetry property is usually highlydesirable. You can always construct a symmetric distancefunction from an asymmetric one:

dS(x, y) = d(x, y) + d(y, x)2


Distances – common distances Commonly used distance and similarity measures for numeric data


Distances – common distances Distances Manhattan, Euclidean, Maximum, and Minkowski have all

similar properties, the application of each may depend on empiricaltesting, or on subtle details of the data-domain

Minkowski(L metric)

Maximum(L infinity) Minkowski

(L metric)


Distances – correlation-based Correlation-based measures: used if the data is numeric but not mixable

(so that it does not make sense to add a random fraction of one data set toa random fraction of a different data set), as for example, in time series

The dot product of two points is the cosine of the angle that the twovectors make with each other - if they are perfectly aligned then theangle is 0 and the cosine (and the correlation) is 1; If they are at rightangles to each other, the cosine is 0

The only difference between the dotproduct and the correlation coefficient isthat for the second, we first center bothdata points by subtracting their respectivemeans

By construction, the value of a dot productalways falls in the interval [0, 1], and thecorrelation coefficient always falls in theinterval [−1, 1]


Distances – binary and sparse If the data is categorical, then we can count the number of features

that do not agree in both data points (i.e., the number of mismatchedfeatures); this is the Hamming distance

As an example, imagine a patient’s health record: each possiblemedical condition constitutes a feature, and we want to knowwhether the patient has ever suffered from it

In situations where the features are categorical, binary, and sparse(just a few are On), we may be interested in matches betweenfeatures that are On than those that are Off; this leads us to theJaccard coefficient s: the number of matches between features thatare On for both points, divided by the number of features that areOn in at least one of the data points

The Jaccard coefficient is a similarity measure; the correspondingdistance function is the Jaccard distance dj = 1-sj


Distances – binary and sparse If the data is categorical, then we can count the number of features

that do not agree in both data points (i.e., the number of mismatchedfeatures); this is the Hamming distance

As an example, imagine a patient’s health record: each possiblemedical condition constitutes a feature, and we want to knowwhether the patient has ever suffered from it

In situations where the features are categorical, binary, and sparse(just a few are On), we may be interested in matches betweenfeatures that are On than those that are Off; this leads us to theJaccard coefficient s: the number of matches between features thatare On for both points, divided by the number of features that areOn in at least one of the data points

The Jaccard coefficient is a similarity measure; the correspondingdistance function is the Jaccard distance dj = 1-sj

The Jaccard distance:

As an example, imagine graph data. The similarity of two vertices is given by how many neighbors they have in common (On) – what is usually sparse, as just a few vertices are neighbors of a given vertex


Distances – strings If we are dealing with many strings that are rather similar to each

other (distorted through typos, for instance), then we can use a moredetailed measure of the difference between them—namely the editor Levenshtein distance. The Levenshtein distance is the minimumnumber of single-character operations (insertions, deletions, andsubstitutions) required to transform one string into the other

Another approach is to find the length of the longest commonsubsequence; this metric is often used for gene sequence analysis incomputational biology


Distances – strings If we are dealing with many strings that are rather similar to each

other (distorted through typos, for instance), then we can use a moredetailed measure of the difference between them—namely the editor Levenshtein distance. The Levenshtein distance is the minimumnumber of single-character operations (insertions, deletions, andsubstitutions) required to transform one string into the other

Another approach is to find the length of the longest commonsubsequence; this metric is often used for gene sequence analysis incomputational biology

The best distance measure to use does not follow automatically from data type; rather, it depends on the semantics of the data—or, more precisely, on the semantics that you

care about for your current analysis!

In some cases, a simple metric that only calculates the difference in string length may be perfectly sufficient. In another case, you might want to use the Hamming distance.

If you really care about the details of otherwise similar strings, the Levenshtein distance is most appropriate. You might even want to calculate how often each letter appears in a

string and then base your comparison on that.

It all depends on what the data means and on what aspect of it you are interested at the moment (which may also change as the analysis progresses).

Similar considerations apply everywhere—there are no “cookbook” rules.


Clustering methodsDifferent algorithms are suitable for different kinds of problems—

depending, for example, on the shape and structure of the clusters

Some require vector-like data, whereas others require only a distancefunction

Different algorithms tend to be misled by different kinds of pitfalls,and they all have different performance (i.e., computationalcomplexity) characteristics

There are tree main categories of clustering algorithms: centerseekers, tree builders, and neighborhood growers – I said three main, notonly three (check Survey Of Clustering Data Mining Techniques of author PavelBerkhin)


Clustering methods – k-meansOne of the most popular clustering methods is the k-means

algorithm; the k-means algorithm requires the number of expectedclusters k as input, and works in an iterative scheme to search for thecorrect center of each cluster

The main idea is to calculate the position of each cluster’s center (orcentroid) from the positions of the points belonging to the clusterand then to assign points to their nearest centroid – this process isrepeated until sufficient convergence is achieved

The algorithm is as follows:choose initial positions for the cluster centroids

repeat:

for each point:

calculate its distance from each cluster centroid

assign the point to the nearest cluster

recalculate the positions of the cluster centroids


Clustering methods – k-meansThe k-means algorithm is nondeterministic: a different choice of starting

values may result in a different assignment of points to clusters; for thisreason, it is customary to run the k-means algorithm several times and thencompare the results

If you have previous knowledge of likely positions for the cluster centers,you can use it to precondition the algorithm; otherwise, choose randomdata points as initial values.

What makes this algorithm efficient is that you don’t have to search theexisting data points to find one that would make a good centroid—insteadyou are free to construct a new centroid position; this is usually done bycalculating the cluster’s center of mass:


Clustering methods – k-meansThe k-means algorithm is nondeterministic: a different choice of starting

values may result in a different assignment of points to clusters; for thisreason, it is customary to run the k-means algorithm several times and thencompare the results

If you have previous knowledge of likely positions for the cluster centers,you can use it to precondition the algorithm; otherwise, choose randomdata points as initial values.

What makes this algorithm efficient is that you don’t have to search theexisting data points to find one that would make a good centroid—insteadyou are free to construct a new centroid position; this is usually done bycalculating the cluster’s center of mass:

If we are using categorical data, then the k-mean algorithm cannot be used (one cannot calculate the mass center), in this case we must use

the k-medoids algorithm

The only difference is that instead of calculating a new centroid, it is necessary to search all the points in the cluster to find the data point that has the smallest average distance to all other points in its cluster

For this reason, the k-medoids algorithm is O(n2), meanwhile the k-means algorithm is O(k*n), where k is the number of clusters

For performance, it is possible to run k-medoids in a sample of the dataset to have an idea of the cluster centers, and then run it on the

entire dataset


Clustering methods – k-means Despite its cheap-and-cheerful appearance, the k-means algorithm works surprisingly

well. It is pretty fast and relatively robust. Convergence is usually quick. Because thealgorithm is simple and highly intuitive, it is easy to augment or extend it—for example,to incorporate points with different weights. You might also want to experiment withdifferent ways to calculate the centroid, possibly using the median position rather thanthe mean, and so on.

In summary: The k-means algorithms and its variants work best for globular (at least star-convex) clusters; the

results will be meaningless for clusters with complicated shapes and for nested clusters

The expected number of clusters is required as an input; if this number is not known, it will benecessary to repeat the algorithm with different values and compare the results

The algorithm is iterative and nondeterministic; the specific outcome may depend on the choice ofstarting values

The k-means algorithm requires vector data; use the k-medoids algorithm for categorical data

The algorithm can be misled if there are clusters of highly different size or different density

The k-means algorithm is linear in the number of data points; the k-medoids algorithm is quadratic inthe number of points


Clustering methods – DBSCANNeighborhood growers work by connecting points that are

“sufficiently close” to each other to form a cluster and then keepdoing so until all points have been classified

Based on the idea (definition) of a cluster as a region of high density,and it makes no assumptions about the overall shape of the cluster

More robust than k-means variations in respect to the structure ofthe clusters


Clustering methods – DBSCANThe DBSCAN algorithm is an example of Neighborhood grower

It is based on two metrics: The minimum density accepted for the points that define the cluster The size of the region over which we expect the minimum density to be

verified

In practice, the algorithm asks for: The neighborhood radius r The minimum number of points n that we expect to find within the neighborhood of each

point


Clustering methods – DBSCANDBSCAN distinguishes between three types of points: noise, core,

and edge points: A noise point is a point which has fewer than n points in its

neighborhood of radius r, such a point does not belong to anycluster – background data

A core point has more than n neighborsAn edge point is a point that has fewer neighbors than required for

a core point but that is itself the neighbor of a core point - thealgorithm discards noise points and concentrates on core points

Whenever the algorithm finds a core point, it assigns a clusterlabel to that point and then continues to add all its neighbors,and their neighbors recursively to the cluster, until all pointshave been classified


Clustering methods – DBSCANDBSCAN distinguishes between three types of points: noise, core,

and edge points: A noise point is a point which has fewer than n points in its

neighborhood of radius r, such a point does not belong to anycluster

A core point has more than n neighborsAn edge point is a point that has fewer neighbors than required for

a core point but that is itself the neighbor of a core point - thealgorithm discards noise points and concentrates on core points

Whenever the algorithm finds a core point, it assigns a clusterlabel to that point and then continues to add all its neighbors,and their neighbors recursively to the cluster, until all pointshave been classified

Finally, the basic algorithm lends itself to elegant recursive implementations, but keep in mind that the recursion will not unwind until the current

cluster is complete. This means that, in the worst case (of a single connected cluster), you will end up putting the entire data set onto the

stack!


Clustering methods – DBSCANDBSCAN is sensitive to the choice of parameters

For example, if a data set contains several clusters with widely varyingdensities, then a single set of parameters may not be sufficient toclassify all of the clusters

A possible workaround it to use k-means first to identify clustercandidates, and then to extract statistics that will help parametrizeDBSCAN

The computational complexity of DBSCAN is O(n2), what can beameliorated by indexing structures able to quickly find the neighborsof each point


Clustering methods – tree buildersAnother way to find clusters is by successively combining clusters

that are “close” to each other into a larger cluster until only a singlecluster remains; this approach is known as agglomerative hierarchicalclustering, and it leads to a treelike hierarchy of clusters

The distance between clusters is given is respect to representativepoints within each cluster, the possibilities are:Minimum or single link: the two points, one from each cluster that are

closest to each other; handles thinly connected clusters with complicatedshapes, but it is sensible to noise

Maximum or complete link: considers the points the farthest away from eachother, favors compact globular clusters

Average: considers the average between all pairs of pointsCentroid: considers the centroids of each clusterWard’s method: combiners clusters whose coherence is higher; coherence

can be the average distance of all pairs, for example


Clustering methods – tree buildersThe result of hierarchical clustering is not actually a set of clusters;

instead, we obtain a treelike structure that contains the individualdata points at the leaf nodes - this structure can be representedgraphically in a dendrogram

Tree builder algorithms are expensive, on the order of O(n3)

One outstanding feature of hierarchical clustering is that it does more thanproduce a flat list of clusters; it also shows their relationships in an explicit way

Tree builder can benefit from algorithms that are center seeker orneighborhood growers


Pre-processingThe core algorithm for grouping data points into clusters is usually

only part (though the most important one) of the whole strategySome data sets may require some cleanup or normalization before

they are suitable for clustering: that’s the first topic in this sectionFor example, look at the two plots below and answer: which one has

well-defined clusters?


Pre-processingFor example, look at the two plots below and answer: which one has

well-defined clusters?

Well, as a matter of fact, both plots show the same dataset, but with differentaspect ratios

The same applies to datasets that spam to very different ranges – in such cases,it is necessary to normalize the data

Problems like these are not observed in correlation-based distance


Pre-processingThe simplest normalization can be achieved by:

x’ = (x – xmin)/(xmax – xmin)

Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z-score normalization:

x’ = (x – xmean)/xStdDev

But first, use an Interquartile Range analysis to get rid of outliers

Actually, normalization is very sensitive to outliers and distributions that aretoo skewed – for these cases, there are many other normalizationtechniques, check for instance:

http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.htm


Pre-processingThe simplest normalization can be achieved by:

x’ = (x – xmin)/(xmax-xmin)

Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z-score normalization:

x’ = (x - xmean)/xStdDev

But first, use an Interquartile Range analysis to get rid of outliers

Actually, normalization is very sensitive to outliers and distributions that aretoo skewed – for these cases, there are many other normalizationtechniques, check for instance:



Normalization by MeanNormalization by Trimmed MeanNormalization by PercentileScale between 0 and 1Subtract the MeanSubtract the MedianNormalization by Signed RatioNormalization by Log RatioNormalization by Log Ratio in Standard Deviation UnitsZ-score CalculationNormalization by Standard Deviation

Also, the Mahalanobis distance is less susceptible to normalization issues


Post-processing (cluster evaluation) It is also necessary to inspect the results of every clustering algorithm in

order to validate and characterize the clusters that have been foundGiven a set of clusters whose centroids are known, we can think of two

metrics:Mass: the number of points in the clusterRadius: the standard deviation of the distances of all points in relation to

the center of a given cluster; for two dimensions, we would have:r2 = ∑i (xc – xi)2 + (yc – yi)2

(xc,yc ) the center of a cluster

We can also have the density of a cluster given by:density = mass/radius


Post-processing (cluster evaluation)Besides density, there are: Cohesion: the average distance between all points in a cluster, the smaller the more

compact Separation: the average distance between all point in one cluster, and all the points in

another cluster – if we know the centroids, we could use them to simplify calculi

For a set of clusters, we can calculate the average cohesion and separationfor all clusters, and have an idea of the overall quality

If a data set can be clearly grouped into clusters, then we expect thedistance between the clusters to be large compared to the radii of theclusters; therefore, we can think of an interesting metric based on cohesionand separation:

cluster_quality = separation/cohesion


Post-processing (cluster evaluation) One the most used metrics for clustering is the Silhouette coefficient, which for a

sigle point i is given by:

Si = bi – ai .

max(ai,bi)

where ai is the average distance from point i to all other points in its cluster (this ispoint i’s cohesion), bi is the smallest average distance from point i to all the points ineach of the other clusters (this is point i’s separation from the closest other cluster)

The numerator is a measure for the “empty space” between clusters, thedenominator is the biggest between radius and distance between clusters

Next, average the silhouette for all points in each cluster – this is the cluster’ssilhouette; average it for all clusters, this is the clustering’s silhouette

The silhouette coefficient ranges from −1 to 1; negative values indicate that thecluster radius is greater than the distance between clusters, so that clusters overlap;this suggests poor clustering. Large values of S suggest good clustering


Post-processing (cluster evaluation) One the most used metrics for clustering is the Silhouette coefficient, which for a

sigle point i is given by:

Si = bi – ai .

max(ai,bi)

where ai is the average distance from point i to all other points in its cluster (this ispoint i’s cohesion), bi is the smallest average distance from point i to all the points ineach of the other clusters (this is point i’s separation from the closest other cluster)

The numerator is a measure for the “empty space” between clusters, thedenominator is the biggest between radius and distance between clusters

Next, average the silhouette for all points in each cluster – this is the cluster’ssilhouette; average it for all clusters, this is the clustering’s silhouette

The silhouette coefficient ranges from −1 to 1; negative values indicate that thecluster radius is greater than the distance between clusters, so that clusters overlap;this suggests poor clustering. Large values of S suggest good clustering

The silhouette can be used to toss background points from the clustering process, that is, points that notoriously exceed the average cohesion within a given cluster.

This process can be used iteratively – once some points are tossed off, the clustering can be repeated and hopefully produce better results; and again.


Post-processing (cluster evaluation)The clustering silhouette is very important, it not only tells us the quality of

a clustering, it can also tell us what is the correct clustering; for example, consider the following dataset:


Post-processing (cluster evaluation)The clustering silhouette is very important, it not only tells us the quality of

a clustering, it can also tell us what is the correct clustering; for example, consider the following dataset:

Clearly we have clusters, but how many? Visually, we can track from 6 to 8 clusters, depending on the observation.

What to do?


Post-processing (cluster evaluation)One way to solve this problem is to use the k-mean algorithm and calculate

the Silhoutte different numbers of clusters In our example, we would get the following curve:

6 7


Post-processing (cluster evaluation)One way to solve this problem is to use the k-mean algorithm and calculate

the Silhoutte different numbers of clusters In our example, we would get the following curve:

6 7

The plot indicates that 6 or 7 clusters are acceptable answers, the next stage is to consider the data characteristics in order to define what the best answer is.


Warning Just like any other analytical technique, clustering can lead you to

unproductive circumstances (waste of time) if not used with caution; somepoints must be of concern: Most algorithms depend on heuristic parameters that may demand hours for one to

find the most appropriate values Also, the algorithm lend themselves to modifications that, although may sound

intuitively right, are taking you nowhere It is reasonably possible that, although you are looking for, the data has no clusters at

all; it is not such an improbable circumstance because clustering algorithms usuallyare treated as black boxes – be circumspect, attention with the evidences!

Despite the fact that there are evaluation methods and visualization tools, still theclustering result may be flawed; remember, there are no formal theory behind clusterconcepts

Finally, this review is mostly addressed for practitioners, and not foracademic personnel; for those, there are many other aspects that must beconsidered – for more details, please check the paper “Survey Of ClusteringData Mining Techniques” of author Pavel Berkhin, among other sources


References Philipp K. Janert, Data Analysis with Open Source Tools,

O’Reilly, 2010. Wikipedia, http://en.wikipedia.org Wolfram MathWorld, http://mathworld.wolfram.com/

data analysis05 clustering

Education