local sparsification for scalable module identification in networks
Post on 01-Feb-2016
61 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Local Sparsification for Scalable Module Identification in
NetworksSrinivasan Parthasarathy
Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang
Data Mining Research LaboratoryDept. of Computer Science and
EngineeringThe Ohio State University
“Every 2 days we create as much information as we did up
to 2003”- Eric Schmidt, Google ex-CEO
The Data Deluge
2
3
600$ to buy a disk drive that can store
all of the world’s music[McKinsey Global Institute Special Report, June ’11]
Data Storage Costs are Low
4
Data does not exist in isolation.
5
Data almost always exists in connection with other data.
6
Social networksProtein Interactions Internet
VLSI networks Data dependencies Neighborhood graphs
7
All this data is only useful if we can scalably extract useful knowledge
8
Challenges
1. Large Scale
Billion-edge graphs commonplace
Scalable solutions are needed
9
Challenges
2. Noise
Links on the web, protein interactions
Need to alleviate
10
Challenges
3. Novel structure
Hub nodes, small world phenomena, clusters of varying densities and
sizes, directionality
Novel algorithms or techniques are needed
11
Challenges
4. Domain Specific Needs
E.g. Balance, Constraints etc.
Need mechanisms to specify
12
Challenges
5. Network Dynamics
How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?
13
Challenges
6. Cognitive Overload
Need to support guided interaction for human in the
loop
14
Our Vision and Approach
Graph Pre-processing
SparsificationSIGMOD ’11, WebSci’12
Near Neighbor SearchFor non-graph data
PVLDB ’12
SymmetrizationFor directed graphs
EDBT ’10
Core Clustering
Consensus Clustering KDD’06, ISMB’07
Viewpoint Neighborhood Analysis
KDD ’09
Graph Clustering via Stochastic FlowsKDD ’09, BCB ’10
Dynamic Analysis and Visualization
Event Based Analysis KDD’07,TKDD’09
Network VisualizationKDD’08
Density PlotsSIGMOD’08, ICDE’12
Scalable Implementations and Systems Support on Modern Architectures
Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08),
Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)
Application Domains
Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12)Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12,
WebSci’12)
15
Graph Sparsification for Community Discovery
SIGMOD ’11, WebSci’12
16
Is there a simple pre-processing of the graph to reduce the edge set
that can “clarify” or “simplify” its cluster structure?
17
Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.
Graph Clustering and Community Discovery
18
Social Network and Graph Compression
Direct Analytics on compressed representation
Graph Clustering : Applications
19
Optimize VLSI layout
Graph Clustering : Applications
20
Protein function prediction
Graph Clustering : Applications
21
Data distribution to minimize communication and balance load
Graph Clustering : Applications
22
Is there a simple pre-processing of the graph to reduce the edge set
that can “clarify” or “simplify” its cluster structure?
23
Preview
Original Sparsified
[Automatically visualized using Prefuse]
24
The promise
Clustering algorithms can run much faster and be more accurate on a
sparsified graph.
Ditto for Network Visualization
25
Utopian Objective
Retain edges which are likely to be intra-cluster edges, while
discarding likely inter-cluster edges.
26
A way to rank edges on “strength” or similarity.
27
Algorithm: Global Sparsification (G-Spar)
Parameter: Sparsification ratio, s
1. For each edge <i,j>:(i) Calculate Sim ( <i,j> )
2. Retain top s% of edges in order of Sim, discard others
28
Dense clusters are over-represented, sparse clusters under-represented
Works great when the goal is to just find the top communities
29
Algorithm: Local Sparsification (L-Spar)
Parameter: Sparsification exponent, e (0 < e < 1)
1. For each node i of degree di:(i) For each neighbor j:
(a) Calculate Sim ( <i,j> )(ii) Retain top (d i)e
neighbors in order of Sim, for node i
Underscoring the importance of Local Ranking
30
Ensures representation of clusters of varying densities
31
But...
Similarity computation is expensive!
32
A randomized, approximate solution based on
Minwise Hashing[Broder et. al., 1998]
33
Minwise Hashing
{ dog, cat, lion, tiger, mouse}
[ cat, mouse, lion, dog, tiger]
[ lion, cat, mouse, dog, tiger]
Universe
A = { mouse, lion }
mh1(A) = min ( { mouse, lion } ) = mouse
mh2(A) = min ( { mouse, lion } ) = lion
34
Key Fact
For two sets A, B, and a min-hash function mhi():
Unbiased estimator for Sim using k hashes:
35
Time complexity using Minwise Hashing
EdgesHashes
Only 2 sequential passes over input. Great for disk-resident data
Note: exact similarity is less important – we really just care about relative ranking lower k
Theoretical Analysis of L-Spar: Main Results
• Q: Why choose top de edges for a node of degree d?
• A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification.
• Proposition: If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent
• Corollary: The sparsification ratio corresponding to exponent e is no more than
• For = 2.1 and e = 0.5, ~17% edges will be retained.
• Higher (steeper power-laws) and/or lower e leads to more sparsification.
e
e 1
1
2
e
Experiments• Datasets
• 3 PPI networks (BioGrid, DIP, Human)
• 2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks
• Largest network (Orkut), roughly a Billion edges
• Ground truth available for PPI networks and Wiki
• Clustering algorithms
• Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04]
• Compared sparsifications
• L-Spar, G-Spar, RandomEdge and ForestFire
38
Dataset Dataset ((n, mn, m))
SparsSpars. .
RatioRatio
RandomRandom G-SparG-Spar L-SparL-Spar
SpeedSpeed QualitQualityy
SpeedSpeed QualitQualityy
SpeedSpeed QualityQuality
Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)
17% 11x -10% 30x -15% 25x +11%
WikiWiki(1.1M, (1.1M, 53M)53M)
15% 8x -26% 104x -24% 52x +50%
OrkutOrkut(3M, (3M, 117M)117M)
17% 13x +20% 30x +60% 36x +60%
Results Using Metis
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
39
Dataset Dataset ((n, mn, m))
SparsSpars. .
RatioRatio
RandomRandom G-SparG-Spar L-SparL-Spar
SpeedSpeed QualitQualityy
SpeedSpeed QualitQualityy
SpeedSpeed QualityQuality
Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)
17% 11x -10% 30x -15% 25x +11%
WikiWiki(1.1M, (1.1M, 53M)53M)
15% 8x -26% 104x -24% 52x +50%
OrkutOrkut(3M, (3M, 117M)117M)
17% 13x +20% 30x +60% 36x +60%
Results Using Metis
Same sparsification ratio for all 3 methods.
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
40
Dataset Dataset ((n, mn, m))
SparsSpars. .
RatioRatio
RandomRandom G-SparG-Spar L-SparL-Spar
SpeedSpeed QualitQualityy
SpeedSpeed QualitQualityy
SpeedSpeed QualityQuality
Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)
17% 11x -10% 30x -15% 25x +11%
WikiWiki(1.1M, (1.1M, 53M)53M)
15% 8x -26% 104x -24% 52x +50%
OrkutOrkut(3M, (3M, 117M)117M)
17% 13x +20% 30x +60% 36x +60%
Results Using Metis
Good speedups, but typically loss in quality.
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
41
Dataset Dataset ((n, mn, m))
SparsSpars. .
RatioRatio
RandomRandom G-SparG-Spar L-SparL-Spar
SpeedSpeed QualitQualityy
SpeedSpeed QualitQualityy
SpeedSpeed QualityQuality
Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)
17% 11x -10% 30x -15% 25x +11%
WikiWiki(1.1M, (1.1M, 53M)53M)
15% 8x -26% 104x -24% 52x +50%
OrkutOrkut(3M, (3M, 117M)117M)
17% 13x +20% 30x +60% 36x +60%
Results Using Metis
Great speedups and quality.
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
42
Dataset Dataset ((n, mn, m))
Spars. Spars. RatioRatio
L-SparL-Spar
SpeedSpeed QualityQuality
Yeast_NoisyYeast_Noisy(6k, 200k)(6k, 200k)
17% 17x +4%
WikiWiki(1.1M, 53M)(1.1M, 53M)
15% 23x -4.5%
OrkutOrkut(3M, 117M)(3M, 117M)
17% 22x 0%
L-Spar: Results Using MLR-MCL
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
L-Spar: Qualitative Examples
Node Retained neighbors Discarded neighbors
Graph (Wiki article)
Graph Theory, Adjacency list, Adjacency matrix,Model theory
Tessellation,Roman letters used in Mathematics, Morphism
Jack Dorsey (Twitter user, and co-founder)
Biz Stone, Evan Williams, Jason Goldman, Sarah Lacy
Alyssa Milano, JetBlue Airways, WholeFoods, Parul Sharma
Gladiator (Flickr tag)
colosseum, world-heritage, site, italy
europe, travel, canon, sky, summer
Twitter executives, Silicon Valley figures
44
Impact of Sparsification on Noisy Data
As the graphs get noisier, L-Spar is increasingly beneficial.
Impact of Sparsification on Spectrum: Yeast PPI
Global Sparsification results in multiple components
Local sparsification seems to match trends of original graph
Impact of Sparsification on Spectrum: Epinion
Impact of Sparsification on Spectrum: Human PPI
Impact of Sparsification on Spectrum: Flickr
Anatomy of density plot
49
Some measure of density
Specific ordering of the vertices in the graph
Density Overlay Plots
50
Visual Comparison between Global vs Local Sparsification
51
Summary
Sparsification: Simple pre-processing that makes a big difference
• Only tens of seconds to execute on multi-million-node graphs.
• Reduces clustering time from hours down to minutes.
• Improves accuracy by removing noisy edges for several algorithms.
• Helps visualization
• Ongoing and future work
• Spectral results suggests one might be able to provide theoretical rationale – Can we tease it out?
• Investigate other kinds of graphs, incorporating content, novel applications (e.g. wireless sensor networks, VLSI design)
52
Prior Work
•Random edge Sampling [Karger ‘94]
•Sampling in proportion to effective resistances: good guarantees but very slow [Spielman and Srivastava ‘08]
•Matrix sparsification [Arora et. al. ’06]: Fast, but same as random sampling in the absence of weights.
Topological Measures
53
Modularity (from Wikipedia )
54
•Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The value of the modularity lies in the range [−1/2,1). It is positive if the number of edges within groups exceeds the number expected on the basis of chance.
55
56
The MCL algorithm
Expand: M := M*M
Inflate: M := M.^r (r usually 2), renormalize columns
Converged?
Input: A, Adjacency matrixInitialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1
Yes
Output clusters
No
Prune
Enhances flow to well-connected nodes (i.e. nodes within a community).
Increases inequality in each column. “Rich get richer, poor get poorer.”(reduces flow across communities)
Saves memory by removing entries close to zero. Enables faster convergence
Clustering Interpretation: Nodes flowing into the same sink node are assigned same cluster labels
top related