scalable graph clustering using stochastic flows applications to community discovery
DESCRIPTION
Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery. Venu Satuluri and Srinivasan Parthasarathy. Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University. http://www.cse.ohio-state.edu/dmrl. Outline. Introduction - PowerPoint PPT PresentationTRANSCRIPT
KDD 2009
Scalable Graph Clustering using Stochastic FlowsApplications to Community Discovery
Venu Satuluri and Srinivasan Parthasarathy
Data Mining Research LaboratoryDept. of Computer Science and Engineering
The Ohio State University
http://www.cse.ohio-state.edu/dmrl
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Outline
• Introduction- Problem Statement- Markov Clustering (MCL)• Proposed Algorithms- Regularized MCL (R-MCL)- Multi-level Regularized MCL (MLR-MCL)• Evaluation• Conclusions
2
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Problem Statement
Graph Clustering: Partition the vertices of a graph into disjoint sets such that each partition is a well-connected/coherent group.
Applications: • Discovery of protein complexes [Snel ‘02]• Community discovery in social networks [Newman ‘06]•Image segmentation [Shi ‘00]Existing solutions:• Spectral methods [Shi ‘00]• Edge-based agglomerative/divisive methods [Newman ‘04]• Kernel K-Means [Dhillon ‘07]• Metis [Karypis ’98]• Markov Clustering (MCL) [van Dongen ’00]
3
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Markov Clustering (MCL) [van Dongen ‘00]
The original algorithm for clustering graphs using stochastic flows.
Advantages:• Simple and elegant.• Widely used in Bioinformatics because of its noise tolerance and effectiveness.
Disadvantages:• Very slow.
- Takes 1.2 hours to cluster a 76K node social network.• Prone to output too many clusters.
- Produces 1416 clusters on a 4741 node PPI network.
Can we redress the disadvantages of MCL while retaining its advantages?
4
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Terminology
• Flow: Transition probability from a node to another node.• Flow matrix: Matrix with the flows among all nodes; ith
column represents flows out of ith node. Each column sums to 1.
1 2 3
1 2 3
0.5 0.5
1 1
1 2 3
1 0 0.5 0
2 1.0 0 1.0
3 0 0.5 0
Flow
Matrix
5
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
The MCL algorithm
Expand: M := M*M
Inflate: M := M.^r (r usually 2), renormalize columns
Converged?
Input: A, Adjacency matrixInitialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1
Yes
Output clusters
No
Prune
Enhances flow to well-connected nodes as well as to new nodes.
Increases inequality in each column. “Rich get richer, poor get poorer.”
Saves memory by removing entries close to zero.
6
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
The Regularize operator
Why does MCL output many clusters?The original matrix is only used at the start, and neighboring information fades as time goes on.Called “overfitting”; it does not penalize divergence of flows between neighbors.
Remedy: Let qi, i=1:k, be the flow distributions of the k neighbors of node q in the graph. Let wi, i=1:k, be the respective normalized edge weights, flow of q:
Closed solution:
This update defines the Regularize operator. In matrix notation,Regularize(M) := M*MG
= M*(A+I)D-1
7
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
The Regularized-MCL algorithm
Regularize: M := M*MG
Inflate: M := M.^r (r usually 2), renormalize columns
Converged?
Yes
Output clusters
No
Prune
Takes into account flows of the neighbors.
Increases inequality in each column. “Rich get richer, poor get poorer.”
Saves memory by removing entries close to zero.
Input: A, Adjacency matrixInitialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1
8
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Multi-level Regularized MCL
Input Graph
Intermediate Graph
Intermediate Graph
Coarsest Graph
. . . . . .
Coarsen
Coarsen
Coarsen
Run Curtailed R-MCL,project flow.
Run Curtailed R-MCL, project flow.
Input Graph
Run R-MCL to convergence, output clusters.
Faster to run on smaller graphs
first
Captures global topology of
graph
Initializes flow matrix of refined
graph
9
Coarsening operation• Construct a matching: defined as a set of edges, no
vertex is shared among these edges. • Each edge is mapped into a super-node in the
coarsened graph, and the new edges are the union of the original ones.
• Two maps used to keep the track of the process
10
1
2 3
4
5
6 1 4
2 3
5 6
matching mappingA B C
Map1: 1, 2, 5Map1: 4, 3, 6
Map1: A B C
Project flow
11
Evaluation criteria
• The normalized cut of a cluster C in the graph G is defined as:
• Average Ncut:
12
/kAvg.
Comparison with MCL
• Why R-MCL is much faster than MCL?– Regularize is more faster that expansion, because MG
is sparser than M, and R-MCL can stop earlier
• It seems MLR-MCL only upgrades the performance very less, especially the AVG. N-cut
13
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Comparison with Graclus and Metis
Quality: MLR-MCL improves upon both Graclus and Metis14
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Comparison with Graclus and Metis
Speed: MLR-MCL is faster than Graclus and competitive with Metis15
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Evaluation on PPI networks
Yeast PPI network with 4741 proteins and 15148 interactions. Annotations from the Gene Ontology database used as ground truth.
MLR-MCL returns clusters of higher biological significance
than MCL or Graclus.
16
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Conclusions
• Regularized MCL overcomes the fragmentation problem of MCL.
• Multi-level Regularized MCL further improves quality and speed of R-MCL.
• MLR-MCL often outperforms state-of-the-art algorithms, both quality and speed-wise, on a wide variety of real datasets.
Future Directions:• Novel coarsening strategies• Extensions to directed and bi-partite graphs.
Acknowledgements: This work is supported in part by the following grants: NSF CAREER
IIS-0347662, RI-CNS-0403342, CCF-0702586 and IIS-0742999
17
Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy
Thank You!
References:1. MCL - Graph Clustering by Flow Simulation. S. van Dongen, Ph.D. thesis,
University of Utrecht, 2000.2. Graclus - Weighted Graph Cuts without Eigenvectors: A Multilevel Approach.
Dhillon et. al., IEEE. Trans. PAMI, 2007.3. Metis - A fast and high quality multilevel scheme for partitioning irregular
graphs. Karypis and Kumar, SIAM J. on Scientific Computing, 19984. Normalized Cuts and Image Segmentation. Shi and Malik, IEEE. Trans. PAMI,
2000.5. Finding and evaluating community structure in networks. Newman and Girvan,
Phys. Rev. E 69, 2004. 6. The identification of functional modules from the genomic association of genes.
Snel et. al., PNAS 2002.
18