chapter 10. cluster analysis: basic concepts and …considerations for cluster analysis !...

Chapter 10. Cluster Analysis: Basic Concepts and Methods

n  Cluster Analysis: Basic Concepts

n  Par$$oning Methods

n  Hierarchical Methods

n  Density-‐Based Methods

n  Grid-‐Based Methods

n  Evalua$on of Clustering

1

What is Cluster Analysis?

n  Cluster: A collection of data objects n  similar (or related) to one another within the same group n  dissimilar (or unrelated) to the objects in other groups

n  Cluster analysis (or clustering, data segmentation, …) n  Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

n  Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)

n  Typical applications n  As a stand-alone tool to get insight into data distribution n  As a preprocessing step for other algorithms

2

Applications of Cluster Analysis

n  Data reduction n  Summarization: Preprocessing for regression, PCA,

classification, and association analysis n  Compression: Image processing: vector quantization

n  Hypothesis generation and testing n  Prediction based on groups

n  Cluster & find characteristics/patterns for each group n  Finding K-nearest Neighbors

n  Localizing search to one or a small number of clusters n  Outlier detection: Outliers are often viewed as those “far

away” from any cluster

3

Clustering: Application Examples

n  Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species

n  Information retrieval: document clustering n  Land use: Identification of areas of similar land use in an earth

observation database n  Marketing: Help marketers discover distinct groups in their customer

bases, and then use this knowledge to develop targeted marketing programs

n  City-planning: Identifying groups of houses according to their house type, value, and geographical location

n  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

n  Climate: understanding earth climate, find patterns of atmospheric and ocean

n  Economic Science: market resarch

4

Basic Steps to Develop a Clustering Task

n  Feature selection n  Select info concerning the task of interest n  Minimal information redundancy

n  Proximity measure n  Similarity of two feature vectors

n  Clustering criterion n  Expressed via a cost function or some rules

n  Clustering algorithms n  Choice of algorithms

n  Validation of the results n  Validation test (also, clustering tendency test)

n  Interpretation of the results n  Integration with applications

5

Quality: What Is Good Clustering?

n  A good clustering method will produce high quality clusters

n  high intra-class similarity: cohesive within clusters

n  low inter-class similarity: distinctive between clusters

n  The quality of a clustering method depends on

n  the similarity measure used by the method

n  its implementation, and

n  Its ability to discover some or all of the hidden patterns

6

Measure the Quality of Clustering

n  Dissimilarity/Similarity metric n  Similarity is expressed in terms of a distance function,

typically metric: d(i, j) n  The definitions of distance functions are usually rather

different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables

n  Weights should be associated with different variables based on applications and data semantics

n  Quality of clustering: n  There is usually a separate “quality” function that

measures the “goodness” of a cluster. n  It is hard to define “similar enough” or “good enough”

n  The answer is typically highly subjective 7

Considerations for Cluster Analysis

n  Partitioning criteria

n  Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)

n  Separation of clusters

n  Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)

n  Similarity measure

n  Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)

n  Clustering space

n  Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

8

Requirements and Challenges

n  Scalability n  Clustering all the data instead of only on samples

n  Ability to deal with different types of attributes n  Numerical, binary, categorical, ordinal, linked, and mixture of

these n  Constraint-based clustering

n  User may give inputs on constraints n  Use domain knowledge to determine input parameters

n  Interpretability and usability n  Others

n  Discovery of clusters with arbitrary shape n  Ability to deal with noisy data n  Incremental clustering and insensitivity to input order n  High dimensionality

9

Major Clustering Approaches (I)

n  Partitioning approach: n  Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors n  Typical methods: k-means, k-medoids, CLARANS

n  Hierarchical approach: n  Create a hierarchical decomposition of the set of data (or objects)

using some criterion n  Typical methods: Diana, Agnes, BIRCH, CAMELEON

n  Density-based approach: n  Based on connectivity and density functions n  Typical methods: DBSACN, OPTICS, DenClue

n  Grid-based approach: n  based on a multiple-level granularity structure n  Typical methods: STING, WaveCluster, CLIQUE

10

Major Clustering Approaches (II)

n  Model-based: n  A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other n  Typical methods: EM, SOM, COBWEB

n  Frequent pattern-based: n  Based on the analysis of frequent patterns n  Typical methods: p-Cluster

n  User-guided or constraint-based: n  Clustering by considering user-specified or application-specific

constraints n  Typical methods: COD (obstacles), constrained clustering

n  Link-based clustering: n  Objects are often linked together in various ways n  Massive links can be used to cluster objects: SimRank, LinkClus

11



n  Par44oning Methods





12

Partitioning Algorithms: Basic Concept

n  Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)

n  Given k, find a partition of k clusters that optimizes the chosen partitioning criterion

n  Global optimal: exhaustively enumerate all partitions n  Heuristic methods: k-means and k-medoids algorithms n  k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented

by the center of the cluster

n  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

21 )),(( iCp

ki cpdE

i∈= ΣΣ=

13

The K-Means Clustering Method

n  Given k, the k-means algorithm is implemented in four steps:

n  Partition objects into k nonempty subsets

n  Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)

n  Assign each object to the cluster with the nearest seed point

n  Go back to Step 2, stop when the assignment does not change

14

An Example of K-Means Clustering

K=2

Arbitrarily partition objects into k groups

Update the cluster centroids

Update the cluster centroids

Reassign objects Loop if needed

The initial data set

n  Partition objects into k nonempty subsets

n  Repeat n  Compute centroid (i.e., mean

point) for each partition

n  Assign each object to the cluster of its nearest centroid

n  Until no change 15

Comments on the K-Means Method

n  Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

n  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

n  Comment: Often terminates at a local optimal

n  Weakness

n  Applicable only to objects in a continuous n-dimensional space

n  Using the k-modes method for categorical data

n  In comparison, k-medoids can be applied to a wide range of data

n  Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)

n  Sensitive to noisy data and outliers

n  Not suitable to discover clusters with non-convex shapes

16

Variations of the K-Means Method

n  Most of the variants of the k-means which differ in

n  Selection of the initial k means

n  Dissimilarity calculations

n  Strategies to calculate cluster means

n  Handling categorical data: k-modes

n  Replacing means of clusters with modes

n  Using new dissimilarity measures to deal with categorical objects

n  Using a frequency-based method to update modes of clusters

n  A mixture of categorical and numerical data: k-prototype method

17

What Is the Problem of the K-Means Method?

n  The k-means algorithm is sensitive to outliers !

n  Since an object with an extremely large value may substantially

distort the distribution of the data

n  K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most

centrally located object in a cluster

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

18

The K-Medoid Clustering Method

n  K-Medoids Clustering: Find representative objects (medoids) in clusters

n  PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

n  Starts from an initial set of medoids and iteratively replaces one

of the medoids by one of the non-medoids if it improves the

total distance of the resulting clustering

n  PAM works effectively for small data sets, but does not scale

well for large data sets (due to the computational complexity)

n  Efficiency improvement on PAM

n  CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

n  CLARANS (Ng & Han, 1994): Randomized re-sampling

19








20

Hierarchical Clustering n  Use distance matrix as clustering criteria. This method

does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)

21

AGNES (Agglomerative Nesting)

n  Introduced in Kaufmann and Rousseeuw (1990) n  Implemented in statistical packages, e.g., Splus n  Use the single-link method and the dissimilarity matrix n  Merge nodes that have the least dissimilarity

n  Go on in a non-descending fashion n  Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

22

Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

23

DIANA (Divisive Analysis)

n  Introduced in Kaufmann and Rousseeuw (1990)

n  Implemented in statistical analysis packages, e.g., Splus

n  Inverse order of AGNES

n  Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

24

Distance between Clusters

n  Single link: smallest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

n  Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

n  Average: avg distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

n  Centroid: distance between the centroids of two clusters, i.e.,

dist(Ki, Kj) = dist(Ci, Cj)

n  Medoid: distance between the medoids of two clusters, i.e., dist(Ki,

Kj) = dist(Mi, Mj)

n  Medoid: a chosen, centrally located object in the cluster

X X

25

Centroid, Radius and Diameter of a Cluster (for numerical data sets)

n  Centroid: the “middle” of a cluster

n  Radius: maximum distance between all the points and the centroid

n  Alternatively (text): square root of average distance from any point of the cluster to its centroid

n  Diameter: maximum distance between any two points of the cluster

n  Alternatively (text): square root of average mean squared distance between all pairs of points in the cluster

n  Note: radius and diameter of a cluster are not related directly as they’re in a circle. But have a tendency to be proportional

X

26

Extensions to Hierarchical Clustering

n  Major weakness of agglomerative clustering methods

n  Can never undo what was done previously

n  Do not scale well: time complexity of at least O(n2),

where n is the number of total objects

n  Integration of hierarchical & distance-based clustering

n  BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters

n  CHAMELEON (1999): hierarchical clustering using

dynamic modeling 27

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)

n  Zhang, Ramakrishnan & Livny, SIGMOD’96

n  Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering

n  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

n  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

n  Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans

n  Weakness: handles only numeric data, and sensitive to the order of the data record

28

Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points

LS: linear sum of N points:

SS: square sum of N points

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

∑=

N

iiX

1

2

1∑=

N

iiX

29

CF-Tree in BIRCH

n  Clustering feature: n  Summary of the sta$s$cs for a given subcluster: the 0-‐th, 1st, and 2nd moments of the subcluster from the sta$s$cal point of view

n  Registers crucial measurements for compu$ng cluster and u$lizes storage efficiently

  A CF tree is a height-‐balanced tree that stores the clustering features for a hierarchical clustering n  A nonleaf node in a tree has descendants or “children” n  The nonleaf nodes store sums of the CFs of their children

n  A CF tree has two parameters n  Branching factor: max # of children

n  Threshold: max diameter of sub-‐clusters stored at the leaf nodes

30

The CF Tree Structure

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

31

The Birch Algorithm

n  Cluster Diameter

n  For each point in the input n  Find closest leaf entry n  Add point to leaf entry and update CF n  If entry diameter > max_diameter, then split leaf, and possibly

parents n  Algorithm is O(n) n  Concerns

n  Sensitive to insertion order of data points n  Since we fix the size of leaf nodes, so clusters may not be so natural n  Clusters tend to be spherical given the radius and diameter

measures

∑ −−

2)()1(

1jxixnn

32

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)

n  CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999 n  Measures the similarity based on a dynamic model

n  Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

n  Graph-based, and a two-phase algorithm

1.  Use a graph-partitioning algorithm: cluster objects into a large number of relatively small sub-clusters

2.  Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters

33

KNN Graphs & Interconnectivity n  k-nearest graphs from an original data in 2D:

n  EC{Ci ,Cj } :The absolute inter-connectivity between Ci and Cj: the sum of the weight of the edges that connect vertices in Ci to vertices in Cj

n  Internal inter-connectivity of a cluster Ci : the size of its min-cut bisector ECCi (i.e., the weighted sum of edges that partition the graph into two roughly equal parts)

n  Relative Inter-connectivity (RI):

34

Relative Closeness & Merge of Sub-Clusters

n  Relative closeness between a pair of clusters Ci and Cj : the absolute closeness between Ci and Cj normalized w.r.t. the internal closeness of the two clusters Ci and Cj

n  and are the average weights of the edges that

belong in the min-cut bisector of clusters Ci and Cj , respectively, and is the average weight of the edges that connect vertices in Ci to vertices in Cj

n  Merge Sub-Clusters: n  Merges only those pairs of clusters whose RI and RC are both

above some user-specified thresholds n  Merge those maximizing the function that combines RI and RC

35

Overall Framework of CHAMELEON

Construct (K-NN)

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

K-‐NN Graph

P and q are connected if q is among the top k closest neighbors of p

Rela4ve interconnec4vity: connec$vity of c1 and c2 over internal connec$vity

Rela4ve closeness: closeness of c1 and c2 over internal closeness

36

CHAMELEON (Clustering Complex Objects)

37








38

Density-Based Clustering Methods

n  Clustering based on density (local cluster criterion), such as density-connected points

n  Major features: n  Discover clusters of arbitrary shape n  Handle noise n  One scan n  Need density parameters as termination condition

n  Several interesting studies: n  DBSCAN: Ester, et al. (KDD’96) n  OPTICS: Ankerst, et al (SIGMOD’99). n  DENCLUE: Hinneburg & D. Keim (KDD’98) n  CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based) 39

Density-Based Clustering: Basic Concepts

n  Two parameters:

n  Eps: Maximum radius of the neighbourhood

n  MinPts: Minimum number of points in an Eps-neighbourhood of that point

n  NEps(p): {q belongs to D | dist(p,q) <= Eps}

n  Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

n  p belongs to NEps(q)

n  core point condition:

|NEps (q)| >= MinPts

p

q

MinPts = 5

Eps = 1 cm

40

Density-Reachable and Density-Connected

n  Density-reachable:

n  A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1 …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

n  Density-connected

n  A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts

p

q p2

p q

o

41

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

n  Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

n  Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

42

DBSCAN: The Algorithm

n  Arbitrary select a point p

n  Retrieve all points density-reachable from p w.r.t. Eps and MinPts

n  If p is a core point, a cluster is formed

n  If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database

n  Continue the process until all of the points have been processed

n  If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2)

43

DBSCAN: Sensitive to Parameters

DBSCAN online Demo:

http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html 44

OPTICS: A Cluster-Ordering Method (1999)

n  OPTICS: Ordering Points To Identify the Clustering Structure n  Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) n  Produces a special order of the database wrt its

density-based clustering structure n  This cluster-ordering contains info equiv to the density-

based clusterings corresponding to a broad range of parameter settings

n  Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

n  Can be represented graphically or using visualization techniques

45

OPTICS: Some Extension from DBSCAN

n  Index-based: k = # of dimensions, N: # of points n  Complexity: O(N*logN)

n  Core Distance of an object p: the smallest value ε such that the ε-neighborhood of p has at least MinPts objects Let Nε(p): ε-neighborhood of p, ε is a distance value Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts

MinPts-distance(p), otherwise n  Reachability Distance of object p from core object q is the

min radius value that makes p density-reachable from q Reachability-distanceε, MinPts(p, q) =

Undefined if q is not a core object max(core-distance(q), distance (q, p)), otherwise

46

Core Distance & Reachability Distance

47

ε

ε

Reachability-distance

Cluster-order of the objects

undefined

ε‘

48

Density-Based Clustering: OPTICS & Applications demo: http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo

49








50

Grid-Based Clustering Method

n  Using multi-resolution grid data structure n  Several interesting methods

n  STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

n  CLIQUE: Agrawal, et al. (SIGMOD’98)

n  Both grid-based and subspace clustering n  WaveCluster by Sheikholeslami, Chatterjee,

and Zhang (VLDB’98) n  A multi-resolution clustering approach

using wavelet method

51

STING: A Statistical Information Grid Approach

n  Wang, Yang and Muntz (VLDB’97) n  The spatial area area is divided into rectangular cells n  There are several levels of cells corresponding to different

levels of resolution

52

The STING Clustering Method

n  Each cell at a high level is partitioned into a number of smaller cells in the next lower level

n  Statistical info of each cell is calculated and stored beforehand and is used to answer queries

n  Parameters of higher level cells can be easily calculated from parameters of lower level cell n  count, mean, s, min, max n  type of distribution—normal, uniform, etc.

n  Use a top-down approach to answer spatial data queries n  Start from a pre-selected layer—typically with a small

number of cells n  For each cell in the current level compute the confidence

interval 53

STING Algorithm and Its Analysis

n  Remove the irrelevant cells from further consideration n  When finish examining the current layer, proceed to the

next lower level n  Repeat this process until the bottom layer is reached n  Advantages:

n  Query-independent, easy to parallelize, incremental update

n  O(K), where K is the number of grid cells at the lowest level

n  Disadvantages: n  All the cluster boundaries are either horizontal or

vertical, and no diagonal boundary is detected

54







n  Evalua4on of Clustering

55

Measuring Clustering Quality

n  3 kinds of measures: External, internal and relative

n  External: supervised, employ criteria not inherent to the dataset

n  Compare a clustering against prior or expert-specified knowledge (i.e., the ground truth) using certain clustering quality measure

n  Internal: unsupervised, criteria derived from data itself

n  Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are, e.g., Silhouette coefficient

n  Relative: directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm

56

Measuring Clustering Quality: External Methods

n  Clustering quality measure: Q(C, T), for a clustering C given the ground truth T

n  Q is good if it satisfies the following 4 essential criteria n  Cluster homogeneity: the purer, the better n  Cluster completeness: should assign objects belong to

the same category in the ground truth to the same cluster

n  Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category)

n  Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces

57

Chapter 11. Cluster Analysis: Advanced Methods

n  Probability Model-Based Clustering

n  Clustering High-Dimensional Data

n  Clustering Graphs and Network Data

n  Clustering with Constraints

58

Fuzzy Set and Fuzzy Cluster

n  Clustering methods discussed so far n  Every data object is assigned to exactly one cluster

n  Some applications may need for fuzzy or soft cluster assignment n  Ex. An e-game could belong to both entertainment and software

n  Methods: fuzzy clusters and probabilistic model-based clusters n  Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1) n  Example: Popularity of cameras is defined as a fuzzy mapping

n  Then, A(0.05), B(1), C(0.86), D(0.27)

59

Fuzzy (Soft) Clustering

n  Example: Let cluster features be n  C1 :“digital camera” and “lens” n  C2: “computer“

n  Fuzzy clustering n  k fuzzy clusters C1, …,Ck ,represented as a partition matrix M = [wij] n  P1: for each object oi and cluster Cj, 0 ≤ wij ≤ 1 (fuzzy set) n  P2: for each object oi, , equal participation in the clustering n  P3: for each cluster Cj , ensures there is no empty cluster

n  Let c1, …, ck as the center of the k clusters

n  For an object oi, sum of the squared error (SSE), p is a parameter: n  For a cluster Ci, SSE: n  Measure how well a clustering fits the data:

60

Probabilistic Model-Based Clustering

n  Cluster analysis is to find hidden categories. n  A hidden category (i.e., probabilistic cluster) is a distribution over the

data space, which can be mathematically represented using a probability density function (or distribution function).

n  Ex. 2 categories for digital cameras sold n  consumer line vs. professional line n  density functions f1, f2 for C1, C2

n  obtained by probabilistic clustering

n  A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently

n  Out task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process

61

Model-Based Clustering

n  A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.

n  Probability of an object o generated by cluster Cj is

n  Probability of o generated by the set of cluster C is

n  Since objects are assumed to be generated independently, for a data set D = {o1, …, on}, we have,

n  Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized n  However, maximizing P(D|C) is often intractable since the probability

density function of a cluster can take an arbitrarily complicated form n  To make it computationally feasible (as a compromise), assume the

probability density functions being some parameterized distributions

62

The EM (Expectation Maximization) Algorithm

n  EM: A popular framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.

n  EM clustering: an extension to k-means. Starts with an initial estimates of the parameter vector, then iterates:

n  E-step (expectation step) assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters

n  M-step (maximization step) finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood

n  k-means also has two steps at each iteration: n  E-step: Given the current cluster centers, each object is assigned to the cluster whose

center is closest to the object: An object is expected to belong to the closest cluster

n  M-step: Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized

n  Algorithm converges fast but may not be in global optima

63

EM Clustering

n  Initially, randomly assign k cluster centers n  Iteratively refine the clusters based on two steps

n  Expectation step: assign each data point Xi to cluster Ck with the following probability

n  Maximization step: n  Estimation of model parameters

64

EM Clustering Example

n  Initially, let c1 = a and c2 = b n  1st E-step: assign o to c1,w. wt =

n 

n  1st M-step: recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE)

n  Iteratively calculate this until the cluster centers converge or the change is small enough

65

Univariate Gaussian Mixture Model

n  O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th distribution using parameter θj, we have

n  Univariate Gaussian mixture model n  Assume the probability density function of each cluster follows a 1-

d Gaussian distribution. Suppose that there are k clusters. n  The probability density function of each cluster are centered at µj

with standard deviation σj, θj, = (µj, σj), we have

66

Computing Mixture Models with EM

n  Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk} s.t.,P(O|Θ) is maximized, where θj = (µj, σj) are the mean and standard deviation of the j-th univariate Gaussian distribution

n  We initially assign random values to parameters θj, then iteratively conduct the E- and M- steps until converge or sufficiently small change

n  At the E-step, for each object oi, calculate the probability that oi belongs to each distribution,

n  At the M-step, adjust the parameters θj = (µj, σj) so that the expected likelihood P(O|Θ) is maximized

67

Strengths and Weaknesses of Mixture Models

n  Strength n  Mixture models are more general than partitioning and fuzzy

clustering

n  Clusters can be characterized by a small number of parameters n  The results may satisfy the statistical assumptions of the

generative models

n  Weakness n  Converge to local optimal (overcome: run multi-times w. random

initialization)

n  Computationally expensive if the number of distributions is large, or the data set contains very few observed data points

n  Need large data sets

n  Hard to estimate the number of clusters 68






69

Clustering High-Dimensional Data

n  Clustering high-dimensional data (How high is high-D in clustering?) n  Many applications: text documents, DNA micro-array data n  Major challenges:

n  Many irrelevant dimensions may mask clusters n  Distance measure becomes meaningless—due to equi-distance n  Clusters may exist only in some subspaces

n  Methods n  Subspace-clustering: Search for clusters existing in subspaces of

the given high dimensional data space n  CLIQUE, ProClus, and bi-clustering approaches

n  Dimensionality reduction approaches: Construct a much lower dimensional space and search for clusters there (may construct new dimensions by combining some dimensions in the original data)

n  Dimensionality reduction methods and spectral clustering 70

Traditional Distance Measures May Not Be Effective on High-D Data

n  Traditional distance measure could be dominated by noises in many dimensions

n  Ex. Which pairs of customers are more similar?

n  By Euclidean distance, we get,

n  despite Ada and Cathy look more similar n  Clustering should not only consider dimensions but also attributes

(features) n  Feature transformation: effective if most dimensions are relevant

(PCA & SVD useful when features are highly correlated/redundant) n  Feature selection: useful to find a subspace where the data have

nice clusters 71

The Curse of Dimensionality (graphs adapted from Parsons et al. KDD Explorations

2004)

n  Data in only one dimension is relatively packed

n  Adding a dimension “stretch” the points across that dimension, making them further apart

n  Adding more dimensions will make the points further apart—high dimensional data is extremely sparse

n  Distance measure becomes meaningless—due to equi-distance

72

Why Subspace Clustering? (adapted from Parsons et al. SIGKDD Explorations

2004)

n  Clusters may exist only in some subspaces n  Subspace-clustering: find clusters in all the subspaces

73

Subspace Clustering Methods

n  Subspace search methods: Search various subspaces to find clusters

n  Bottom-up approaches

n  Top-down approaches

n  Correlation-based clustering methods

n  E.g., PCA based approaches

n  Bi-clustering methods

n  Optimization-based methods

n  Enumeration methods

74

Subspace Clustering Method (I): Subspace Search Methods

n  Search various subspaces to find clusters n  Bottom-up approaches

n  Start from low-D subspaces and search higher-D subspaces only when there may be clusters in such subspaces

n  Various pruning techniques to reduce the number of higher-D subspaces to be searched

n  Ex. CLIQUE (Agrawal et al. 1998) n  Top-down approaches

n  Start from full space and search smaller subspaces recursively n  Effective only if the locality assumption holds: restricts that the

subspace of a cluster can be determined by the local neighborhood n  Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid-like method

75

CLIQUE (SubSpace Clustering with Aprori Pruning)

n  Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

n  Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

n  CLIQUE can be considered as both density-based and grid-based

n  It partitions each dimension into the same number of equal length interval

n  It partitions an m-dimensional data space into non-overlapping rectangular units

n  A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

n  A cluster is a maximal set of connected dense units within a subspace

76

CLIQUE: The Major Steps

n  Partition the data space and find the number of points that lie inside each cell of the partition.

n  Identify the subspaces that contain clusters using the Apriori principle

n  Identify clusters

n  Determine dense units in all subspaces of interests n  Determine connected dense units in all subspaces of

interests.

n  Generate minimal description for the clusters n  Determine maximal regions that cover a cluster of

connected dense units for each cluster n  Determination of minimal cover for each cluster

77

Sala

ry

(10,

000)

20 30 40 50 60 age

5 4

3 1

2 6

7 0

20 30 40 50 60 age

5 4

3 1

2 6

7 0

Vaca

tion(

wee

k)

age

Vaca

tion

30 50

τ = 3

78

Strength and Weakness of CLIQUE

n  Strength n  automatically finds subspaces of the highest

dimensionality such that high density clusters exist in those subspaces

n  insensitive to the order of records in input and does not presume some canonical data distribution

n  scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

n  Weakness n  The accuracy of the clustering result may be degraded

at the expense of simplicity of the method

79

Bi-Clustering Methods

n  Real-world data is noisy: Try to find approximate bi-clusters n  Methods: Optimization-based methods vs. enumeration methods n  Optimization-based methods

n  Try to find a submatrix at a time that achieves the best significance as a bi-cluster

n  Due to the cost in computation, greedy search is employed to find local optimal bi-clusters

n  Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000) n  Enumeration methods

n  Use a tolerance threshold to specify the degree of noise allowed in the bi-clusters to be mined

n  Then try to enumerate all submatrices as bi-clusters that satisfy the requirements

n  Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al., ICDM’2003)

80

81

Bi-Clustering for Micro-Array Data Analysis

n  Left figure: Micro-array “raw” data shows 3 genes and their values in a multi-D space: Difficult to find their patterns

n  Right two: Some subsets of dimensions form nice shift and scaling patterns

n  No globally defined similarity/distance measure n  Clusters may not be exclusive

n  An object can appear in multiple clusters

81






82

Clustering Graphs and Network Data

n  Applications n  Bi-partite graphs, e.g., customers and products, authors

and conferences n  Web search engines, e.g., click through graphs and

Web graphs n  Social networks, friendship/coauthor graphs

n  Similarity measures n  Geodesic distances n  Distance based on random walk (SimRank)

n  Graph clustering methods n  Minimum cuts: FastModularity (Clauset, Newman &

Moore, 2004) n  Density-based clustering: SCAN (Xu et al., KDD’2007)

83

Similarity Measure (I): Geodesic Distance

n  Geodesic distance (A, B): length (i.e., # of edges) of the shortest path between A and B (if not connected, defined as infinite)

n  Eccentricity of v, eccen(v): The largest geodesic distance between v and any other vertex u ∈ V − {v}. n  E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3

n  Radius of graph G: The minimum eccentricity of all vertices, i.e., the distance between the “most central point” and the “farthest border” n  r = min v∈V eccen(v) n  E.g., radius (g) = 2

n  Diameter of graph G: The maximum eccentricity of all vertices, i.e., the largest distance between any pair of vertices in G n  d = max v∈V eccen(v) n  E.g., diameter (g) = 3

n  A peripheral vertex is a vertex that achieves the diameter. n  E.g., Vertices c, d, and e are peripheral vertices

84

SimRank: Similarity Based on Random Walk and Structural Context

n  SimRank: structural-context similarity, i.e., based on the similarity of its neighbors

n  In a directed graph G = (V,E), n  individual in-neighborhood of v: I(v) = {u | (u, v) ∈ E} n  individual out-neighborhood of v: O(v) = {w | (v, w) ∈ E}

n  Similarity in SimRank:

n  Initialization: n  Then we can compute si+1 from si based on the definition

n  Similarity based on random walk: in a strongly connected component n  Expected distance: n  Expected meeting distance: n  Expected meeting probability:

P[t] is the probability of the tour

85

Graph Clustering: Sparsest Cut

n  G = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E | u ∈ S, v ∈ T } and S and T are in two partitions

n  Size of the cut: # of edges in the cut set n  Min-cut (e.g., C1) is not a good partition n  A better measure: Sparsity:

n  A cut is sparsest if its sparsity is not greater than that of any other cut n  Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut n  For k clusters, the modularity of a clustering assesses the quality of the

clustering:

n  The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected

n  The optimal clustering of graphs maximizes the modularity

li: # edges between vertices in the i-th cluster di: the sum of the degrees of the vertices in the i-th cluster

86

Graph Clustering: Challenges of Finding Good Cuts

n  High computational cost n  Many graph cut problems are computationally expensive n  The sparsest cut problem is NP-hard n  Need to tradeoff between efficiency/scalability and quality

n  Sophisticated graphs n  May involve weights and/or cycles.

n  High dimensionality n  A graph can have many vertices. In a similarity matrix, a vertex is

represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph

n  Sparsity n  A large graph is often sparse, meaning each vertex on average

connects to only a small number of other vertices n  A similarity matrix from a large sparse graph can also be sparse

87

Two Approaches for Graph Clustering

n  Two approaches for clustering graph data n  Use generic clustering methods for high-dimensional data n  Designed specifically for clustering graphs

n  Using clustering methods for high-dimensional data n  Extract a similarity matrix from a graph using a similarity measure n  A generic clustering method can then be applied on the similarity

matrix to discover clusters n  Ex. Spectral clustering: approximate optimal graph cut solutions

n  Methods specific to graphs n  Search the graph to find well-connected components as clusters n  Ex. SCAN (Structural Clustering Algorithm for Networks)

n  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07

88






89

Why Constraint-Based Cluster Analysis?

n  Need user feedback: Users know their applications the best n  Less parameters but more user-desired constraints, e.g., an

ATM allocation problem: obstacle & desired clusters

90

Categorization of Constraints n  Constraints on instances: specifies how a pair or a set of instances

should be grouped in the cluster analysis n  Must-link vs. cannot link constraints

n  must-link(x, y): x and y should be grouped into one cluster n  Constraints can be defined using variables, e.g.,

n  cannot-link(x, y) if dist(x, y) > d n  Constraints on clusters: specifies a requirement on the clusters

n  E.g., specify the min # of objects in a cluster, the max diameter of a cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)

n  Constraints on similarity measurements: specifies a requirement that the similarity calculation must respect n  E.g., driving on roads, obstacles (e.g., rivers, lakes)

n  Issues: Hard vs. soft constraints; conflicting or redundant constraints

91

Constraint-Based Clustering Methods (I): Handling Hard Constraints

n  Handling hard constraints: Strictly respect the constraints in cluster assignments

n  Example: The COP-k-means algorithm n  Generate super-instances for must-link constraints

n  Compute the transitive closure of the must-link constraints n  To represent such a subset, replace all those objects in the

subset by the mean. n  The super-instance also carries a weight, which is the number

of objects it represents n  Conduct modified k-means clustering to respect cannot-link

constraints n  Modify the center-assignment process in k-means to a nearest

feasible center assignment n  An object is assigned to the nearest center so that the

assignment respects all cannot-link constraints

92

Constraint-Based Clustering Methods (II): Handling Soft Constraints

n  Treated as an optimization problem: When a clustering violates a soft constraint, a penalty is imposed on the clustering

n  Overall objective: Optimizing the clustering quality, and minimizing the constraint violation penalty

n  Ex. CVQE (Constrained Vector Quantization Error) algorithm: Conduct k-means clustering while enforcing constraint violation penalties

n  Objective function: Sum of distance used in k-means, adjusted by the constraint violation penalties n  Penalty of a must-link violation

n  If objects x and y must-be-linked but they are assigned to two different centers, c1 and c2, dist(c1, c2) is added to the objective function as the penalty

n  Penalty of a cannot-link violation n  If objects x and y cannot-be-linked but they are assigned to a

common center c, dist(c, c′), between c and c′ is added to the objective function as the penalty, where c′ is the closest cluster to c that can accommodate x or y

93

An Example: Clustering With Obstacle Objects

Taking obstacles into account Not Taking obstacles into account 94

What Are Outliers? (ch11)

n  Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism n  Unusual credit card purchase, Michael Jordon in sports ...

n  Outliers are different from the noise data n  Noise is random error or variance in a measured variable n  Noise should be removed before outlier detection

n  Outliers are interesting: It violates the mechanism that generates the normal data

n  Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model

n  Applications: n  Credit card fraud detection n  Telecom fraud detection n  Customer segmentation n  Medical analysis

95

chapter 10. cluster analysis: basic concepts and …considerations for cluster analysis !...

Documents