a min-max cut algorithm for graph partitioning and data clustering

50
A Min-max Cut Algorithm for Graph Partitioning and Data Clustering Authors : Chris H.Q. Ding a , Xiaofeng He a,b , Hongyuan Zha b , Ming Gu c , Horst D. Simon a a NERSC Division, Lawrence Berkeley National Laboratory, University of California b Department of Computer Science and Engineering, Pennsylvania State University c Department of Mathematics, University of California From : 1-th IEEE International Conference on Data Mining (ICDM), 2001 Google scholar : 419

Upload: harlan-sosa

Post on 02-Jan-2016

42 views

Category:

Documents


4 download

DESCRIPTION

A Min-max Cut Algorithm for Graph Partitioning and Data Clustering. Authors : Chris H.Q. Ding a , Xiaofeng He a,b , Hongyuan Zha b , Ming Gu c , Horst D. Simon a a NERSC Division, Lawrence Berkeley National Laboratory, University of California - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Authors : Chris H.Q. Dinga, Xiaofeng Hea,b, Hongyuan Zhab, Ming Guc, Horst D. Simona

aNERSC Division, Lawrence Berkeley National Laboratory, University of CaliforniabDepartment of Computer Science and Engineering, Pennsylvania State UniversitycDepartment of Mathematics, University of CaliforniaFrom : 1-th IEEE International Conference on Data Mining (ICDM), 2001Google scholar : 419

Page 2: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Outline

• 1. Introduction (5)• 2. Mix-max cut (11)• 3. Related work on spectral graph partition (5)• 4. Random graph model (2)• 5. Mcut algorithm (2)• 6. Experiments (5)• 7. Skewed cut (9)• 8. Linkage-based refinements (6)• 9. Linkage differential order (2)• 10. Conclusion

Page 3: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Introduction (1/5)

• The graphs generated from the World Wide Web are highly irregular or random.

• Partitioning the web graph is useful to automatically identify topics from the retrieved webpages for a user query.

Page 4: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Introduction (2/5)

• In this paper, the authors emphasize graph partition as data clustering using a graph model.

• Given the attributes of the data points in a dataset and the similarity or affinity metric between any two points, the symmetric forms a weighted adjacency matrix.– Thus, the data clustering problem becomes a

graph partition problem.

Page 5: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Introduction (3/5)

• The authors proposed a new graph partition method based on the min-max clustering principle :– The similarity or association between two

subgraphs is minimized, while the similarity or association within each subgraph (summation of similarity between all pairs of nodes within a subgraph) is maximized.

Page 6: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Introduction (4/5)

• The optimal solution to the graph partition problem is NP-complete.

• The relaxed version of the min-max cut function optimization leads to a generalized eigenvalue problem.

Page 7: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Introduction (5/5)

• The second lowest eigenvectors, also called the Fiedler vector, provides a linear search order (Fiedler vector)– It is generally believed that the Fiedler order

provides the best known linearized order to search for the optimal cut.

• Here, the authors find a linkage differential order that provides a better ordination than the Fiedler vector.

Page 8: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (1/11)

• Given a weighted graph G=G(E, V) and weight matrix W, we wish to partition it into two subgraphs A, B using the mix-max clustering principle.– The similarity or association between two nodes is

their edge weight Wuv. – The similarity between subgraphs A, B is the

cutsize.

Page 9: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (2/11)

The min-max clustering principle requires we minimize cut(A, B) while maximizing W(A) and W(B).All these requirements can be simultaneously satisfied by the following objective function :

The authors called it the min-max cut function or Mcut for short.

Page 10: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (3/11)

• To reveal the solution, the authors reorder the rows and columns of W conformally with subgraphs A and B such that

Page 11: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (4/11)

• Let x and y be vectors conformally partitioned with A and B.

Mcut is invariant under changes of ||x||2 and ||y||2, and

Page 12: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (5/11)

Page 13: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (6/11)

Equation (7)

Page 14: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (7/11)

• It follows from Theorem 1 that both ratios of (7) are equal at the optimal solution :

Hence the resulting clusters tend to have similar weights and are thus balanced.

Page 15: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (8/11)Fiedler linear search order

• The solution to partition problem can be represented by an indicator vector q, – where the nodal value of q on node u is qu = {a, -b}.

• Theorem 1 implies that the solution vector x, y must lie in the eigenspace of .– The first eigenvector with the largest eigenvalue does

not match q.– The second eigenvector z2 satisfies zT

2z1=0 and has positive and negative elements.• A good approximation of q.

Page 16: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (9/11)Fiedler linear search order

• More directly, it can show that :

Relaxing qu to real number in [-1, 1], the solution for minimizing Rayleigh quotient JN(q) is given by :

subject to

Page 17: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (10/11)Fiedler linear search order

• We obtain the lower bound for the Mcut objective :

Note that this bound is the same as the optimal Mcut value

21

22

42

4

2

2

221

Page 18: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Min-max Cut (11/11)Fiedler linear search order

• 此節得到了三個結論 :• 1. 在適當的條件下,分群後的每群大小要相似。

• 2. 在適當的條件下,能夠找出最佳的近似解。

• 3. 在適當的條件下,能夠找出 Mcut的lower bound.

Page 19: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Related work on spectral graph partition (1/5)

• Spectral graph partitioning is based on the properties of eigenvectors of the Laplacian matrix L = D – W.– First developed by Donath and Hoffman [8] and

Fiedler[11,12], and recently populated by the work of Pothen, Simon and Liu[21].

[8] Donath, W. E. and Homan, A. J. (1973). Lower bounds for the partitioning of graphs. IBM J. Res. Develop., 17, 420 - 425. [11] Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J., 23, 298 - 305.[12] M. Fiedler, A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory, Czech. Math. J., 25, 619–637 (1975)[21] Pothen, A., Simon, H. D., and Liou, K. P. (1990). Partitioning sparse matrices with eigenvectors of graphs. SIAM Journal of Matrix Anal. Appl., 11, 430 - 452.

Page 20: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Related work on spectral graph partition (2/5)

• The objective of the partitioning is to minimize the cut size J(A, B) = cut(A, B) with the requirement that two subgraphs have the same number of nodes : |A| = |B|.

Page 21: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Related work on spectral graph partition (3/5)

• Hagen and Kahng [14] remove the requirement |A|=|B| and show that the Fiedler vector provides a good linear search order to the ratio cut (Rcut) partitioning criteria[4].

[4] C.-K. Cheng and Y.A. Wei. An improved two-way partitioning algorithm with stable performance. IEEE. Trans. on Computed Aided Design, 10: 1502-1511, 1991.[14] L. Hagen and A.B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE. Trans. on Computed Aided Design, 11:1074-1085. 1992.

Page 22: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Related work on spectral graph partition (4/5)

• Shi and Malik propose the normalized cut,

• They show that Ncut can be reduced to Ncut(A, B) = JN(q). (Rayleigh quotient)

Page 23: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Related work on spectral graph partition (5/5)

• Beside spectral partitioning methods, other recent partitioning methods seek to minimize the sum of subgraph diameters, see [7] or k-center problem [1] for examples.

[1] P.K. Agarwal and C.M. Procopiuc. Exact and approximation algorithms for clustering. Proc. 9th ACM- SIAM Symposium on Discrete Algorithm, pages 658-667, 1998.[7] S. Doddi, M.V. Marathe, S. S. Ravi, D. S. Taylor, and P. Widmayer. Approximation algorithms for clustering to minimize the sum of diameters. Nordic Journal of Computing, 7(3):185, Fall 2000.

Page 24: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Random graph model (1/2)

• Suppose we have a uniformly distributed random graph with n nodes. – For this random graph, any two nodes are connected

with probability p . • The authors consider the four objective

functions : – MINcut– Rcut– Ncut– Mcut

Page 25: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Random graph model (2/2)

• It have the following• Theorem 2. • For random graphs, MINcut favors highly

skewed cuts, i.e., very uneven sizes. • Mcut favors balanced cut, i.e., both subgraphs

have the same sizes. • Rcut and Ncut show no size preferences.

Page 26: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Mcut algorithm (1/2)

• Step 1 :– Compute the Fiedler vector from Eq(10).

– Sort nodal values to obtain the Fiedler order.

Page 27: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Mcut algorithm (2/2)

• Step 2 :– Search for the optimal cut point corresponding to

the lowest Mcut based on the Fiedler order.• Step 3 :– Do linkage-based refinements.

Page 28: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Experiments (1/5)

• Here we perform experiments on newsgroup articles in 20 newsgroups. We focus on three datasets, each has two newsgroups:

Page 29: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Experiments (2/5)

• Word-document matrix X is first constructed.– 2000 words are selected according to the mutual

information between words and documents.– Tf-idf scheme for term weighting is used and

standard cosine similarity between two documents.• When each document, column of X, is

normalized to 1 using L2 norm, document-document similarities are calculated as W = XTX.

Page 30: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Experiments (3/5)

• To increase statistics, the authors divide one newsgroup A randomly into K1 subgroups and the other newsgroup B randomly into K2 subgraphs.– One of the K1 subgroups of A is mixed with one of the

K2 subgroups of B to produce a dataset G.

• The graph partition methods are run on this dataset G to produce two clusters.– This is repeated for all K1K2 pairs between A and B, and

the accuracy is averaged.

Page 31: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Experiments (4/5)

• Mcut performs about the same as Ncut for newsgroups NG1/NG2, where the cluster overlap is small.

• Mcut performs better than Ncut for newsgroups NG10/NG11 and NG18/NG19, where the cluster overlap is large.

Balanced case.

Page 32: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Experiments (5/5)

• A harder problem due to the unbalanced prior distributions.• Mcut consistently performs better than Ncut for cases where

the cluster overlaps are large.

Unbalanced case.

Page 33: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (1/9)

• Now study the reasons that Mcut consistently outperforms Ncut in large overlap cases.

• Ncut can be written as :

Page 34: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

*Ncut has two pronounced valleys, and produces a skewed cut, while Mcut has a very flat valley and gives balanced cuts.

Page 35: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (3/9)

• Ncut produces a cutsize of 262.7, much larger than the weight W(B, B) = 169.

• In the Mcut objective, the cutsize is absent in the denominators; this provides a abalanced cut.

Page 36: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (4/9)

• Prompted by these studies, the authors provide further analysis.

• Consider the balanced cases where• Let :

• f>0 is the average fraction of cut relative to within cluster associations.

Page 37: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (5/9)

• A example of the f :• <W> = 100• cutsize(A, B) = 50• f = 50/100, = 0.5

Page 38: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (6/9)

• The corresponding value is :

• For a skewed partition A1, B1, we have W(A1)<<W(B1) and then cut(A1, B1)<<W(B1). The Ncut value is :

由於 W(B1) 太大造成後項近似於 0 ,所以省略。

Page 39: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (7/9)

• Using Ncut, a skewed or incorrect cut will happen if Ncut1<Ncut0 :

Page 40: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (8/9)

• Repeat the same analysis using Mcut :

• For large overlap case, say, f=1/2, the condition for possible skewed cut are :

上面這兩個條件,是在認為 skewed partition 的情況底下所推論出來的式子。

Page 41: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Skewed cut (9/9)

• The condition for skewed Ncut is satisfied most of the time, leading to many skewed cuts and there lower clustering accuracy.

• Mcut is not satisfied most of the time, leading to more correct cuts and therefore higher clustering accuracy.

Page 42: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage-based refinements (1/6)

• The linear search order provided by sorting the Fiedler vector q implies that nodes on one side of the cut point must belong to one cluster :

Page 43: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage-based refinements (2/6)

• How to identify those nodes near the cut?• For this purpose, the authors define linkage l

between two subgraphs :

Page 44: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage-based refinements (3/6)

• For a single node u, its linkage to subgraph A is :

– If u is well inside a cluster, u will have a large linkage with the cluster.

– If u is near the cut, its linkages with both clusters should be close.

• Therefore, the authors define the linkage difference :

Page 45: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage-based refinements (4/6)The cut point.

• The authors move all nodes in cluster A with negative value to cluster B if the objective function is lowered.

• This procedure of swapping nodes is called the “linkage-based swap.”

Page 46: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage-based refinements (5/6)

• The greedy move starts from the top of the list to the last node u where s(u)∆l(u)≥0.– s(u)=-1 if u belongs to A.– s(u)=1 if u belongs to B.

• If s(u)∆l(u)<0 but close to 0, node u is in the correct cluster, although it is close to the cut.– Select the smallest 5% of the nodes with s(u)∆l(u)<0 as the

candidates.– Move those which reduce Mcut objective to the other

cluster.– Called this procedure “linkage-based move.”

Page 47: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage-based refinements (6/6)

Page 48: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage differential order (1/2)

• Given the linkage difference in Figure 2, we see that quite a few nodes far away from the cut point have wrong ∆l signs, that is, they belong to the other subgraph.– This strongly suggests that the Fiedler order is not

necessarily the best linear search order.

Page 49: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Linkage differential order (2/2)

• We can sort linkage difference ∆l to obtain a linear order different from the Fiedler order, which will be referred to as linkage differential (LD) order.

Page 50: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Conclusion

• 1. Introduce the Mcut algorithm for graph partition.• 2. Shows that min-max objective function follows

the clustering principle and produces balanced partitions.

• 3. The linkage difference metric effectively identifies those nodes near the cut.

• 4. The new linkage differential order is shown to provide a better linear search order than the best known Fiedler order.