![Page 1: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/1.jpg)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
http://www.mmds.org
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
![Page 2: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/2.jpg)
� We often think of networks being organized
into modules, cluster, communities:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
![Page 3: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/3.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
![Page 4: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/4.jpg)
� Find micro-markets by partitioning the
query-to-advertiser graph:
advertiser
qu
ery
[Andersen, Lang: Communities from seed sets, 2006]J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
![Page 5: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/5.jpg)
� Clusters in Movies-to-Actors graph:
[Andersen, Lang: Communities from seed sets, 2006]J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
![Page 6: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/6.jpg)
� Discovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
![Page 7: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/7.jpg)
How to find communities?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
We will work with undirected (unweighted) networks
![Page 8: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/8.jpg)
� Edge betweenness: Number of
shortest paths passing over the edge
� Intuition:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Edge strengths (call volume)
in a real network
Edge betweenness
in a real network
b=16b=7.5
![Page 9: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/9.jpg)
� Divisive hierarchical clustering based on the
notion of edge betweenness:
Number of shortest paths passing through the edge
� Girvan-Newman Algorithm:� Undirected unweighted networks
� Repeat until no edges are left:
� Calculate betweenness of edges
� Remove edges with highest betweenness
� Connected components are communities
� Gives a hierarchical decomposition of the network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
[Girvan-Newman ‘02]
![Page 10: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/10.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Need to re-compute
betweenness at
every step
4933
121
![Page 11: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/11.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Step 1: Step 2:
Step 3: Hierarchical network decomposition:
![Page 12: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/12.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Communities in physics collaborations
![Page 13: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/13.jpg)
� Zachary’s Karate club:
Hierarchical decomposition
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
![Page 14: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/14.jpg)
1. How to compute betweenness?
2. How to select the number of
clusters?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
![Page 15: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/15.jpg)
� Want to compute
betweenness of
paths starting at
node �� Breath first search
starting from �:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
0
1
2
3
4
![Page 16: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/16.jpg)
� Count the number of shortest paths from � to all other nodes of the network:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
![Page 17: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/17.jpg)
� Compute betweenness by working up the
tree: If there are multiple paths count them
fractionally
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
1 path to K.
Split evenly
1+0.5 paths to J
Split 1:2
1+1 paths to H
Split evenly
The algorithm:•Add edge flows:
-- node flow =
1+∑child edges
-- split the flow up
based on the parent
value
• Repeat the BFS
procedure for each
starting node �
![Page 18: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/18.jpg)
� Compute betweenness by working up the
tree: If there are multiple paths count them
fractionally
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
1 path to K.
Split evenly
1+0.5 paths to J
Split 1:2
1+1 paths to H
Split evenly
The algorithm:•Add edge flows:
-- node flow =
1+∑child edges
-- split the flow up
based on the parent
value
• Repeat the BFS
procedure for each
starting node �
![Page 19: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/19.jpg)
1. How to compute betweenness?
2. How to select the number of
clusters?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
![Page 20: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/20.jpg)
� Communities: sets of
tightly connected nodes
� Define: Modularity �� A measure of how well
a network is partitioned
into communities
� Given a partitioning of the
network into groups �∈�:
Q ∝ ∝ ∝ ∝ ∑s∈∈∈∈ S [ (# edges within group s) –
(expected # edges within group s) ]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Need a null model!
![Page 21: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/21.jpg)
� Given real � on � nodes and edges,
construct rewired network �’� Same degree distribution but
random connections
� Consider �’ as a multigraph
� The expected number of edges between nodes � and � of degrees �and � equals to: � ⋅ �� � � ��� The expected number of edges in (multigraph) G’:
� � ��∑ ∑ � ���∈��∈� � �� ⋅ ��∑ � ∑ ��∈��∈� �� � ��� ⋅ � �
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
j
i
����∈� � 2�Note:
![Page 22: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/22.jpg)
� Modularity of partitioning S of graph G:
� Q ∝∝∝∝ ∑s∈∈∈∈ S [ (# edges within group s) –
(expected # edges within group s) ]
� � �, � � ��∑ ∑ ∑ ��� � � ���∈��∈��∈�� Modularity values take range [−1,1]
� It is positive if the number of edges within
groups exceeds the expected number
� 0.3-0.7<Q means significant community structure
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
Aij = 1 if i→j,
0 elseNormalizing cost.: -1<Q<1
![Page 23: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/23.jpg)
� Modularity is useful for selecting the
number of clusters:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
Next time: Why not optimize Modularity directly?
Q
![Page 24: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/24.jpg)
![Page 25: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/25.jpg)
� Undirected graph ���, !:� Bi-partitioning task:
� Divide vertices into two disjoint groups �,#
� Questions:
� How can we define a “good” partition of �?
� How can we efficiently identify such a partition?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
11
3322
55
4466
A B
1
3
2
5
46
![Page 26: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/26.jpg)
� What makes a good partition?
� Maximize the number of within-group
connections
� Minimize the number of between-group
connections
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
1
3
2
5
46
A B
![Page 27: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/27.jpg)
A B
� Express partitioning objectives as a function
of the “edge cut” of the partition
� Cut: Set of edges with only one vertex in a
group:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
cut(A,B) = 21
3
2
5
46
![Page 28: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/28.jpg)
� Criterion: Minimum-cut
� Minimize weight of connections between groups
� Degenerate case:
� Problem:
� Only considers external cluster connections
� Does not consider internal cluster connectivity
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
arg minA,B cut(A,B)
“Optimal cut”
Minimum cut
![Page 29: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/29.jpg)
� Criterion: Normalized-cut [Shi-Malik, ’97]
� Connectivity between groups relative to the
density of each group
$%&��!: total weight of the edges with at least
one endpoint in �: $%& � � ∑ ��∈�� Why use this criterion?
� Produces more balanced partitions
� How do we efficiently find a good partition?
� Problem: Computing optimal cut is NP-hard
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
[Shi-Malik]
![Page 30: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/30.jpg)
� A: adjacency matrix of undirected G
� Aij =1 if ��, �! is an edge, else 0
� x is a vector in ℜn with components �'�, … , '�!� Think of it as a label/value of each node of �
� What is the meaning of A⋅⋅⋅⋅ x?
� Entry yi is a sum of labels xj of neighbors of iJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
![Page 31: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/31.jpg)
� jth coordinate of A⋅⋅⋅⋅ x :
� Sum of the x-values
of neighbors of j
� Make this a new value at node j
� Spectral Graph Theory:
� Analyze the “spectrum” of matrix representing �� Spectrum: Eigenvectors '� of a graph, ordered by
the magnitude (strength) of their corresponding
eigenvalues )�:J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
� ⋅ ' � ) ⋅ '
![Page 32: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/32.jpg)
� Suppose all nodes in � have degree *and � is connected
� What are some eigenvalues/vectors of �? �⋅' � ) ⋅ ' What is λλλλ? What x?
� Let’s try: ' � ��, �, … , �!� Then: � ⋅ ' � *, *, … , * � ) ⋅ '. So: ) � *� We found eigenpair of �: ' � ��, �, … , �!, ) � *
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
Remember the meaning of + � �⋅':
![Page 33: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/33.jpg)
� G is d-regular connected, A is its adjacency matrix� Claim:
� d is largest eigenvalue of A,
� d has multiplicity of 1 (there is only 1 eigenvector associated with eigenvalue d)
� Proof: Why no eigenvalue *, - *?
� To obtain d we needed '� � '� for every ., /� This means ' � 0 ⋅ �1,1,… , 1! for some const. 0� Define: � = nodes � with maximum possible value of '�� Then consider some vector + which is not a multiple of
vector ��,… , �!. So not all nodes � (with labels +� ) are in �� Consider some node� ∈ � and a neighbor � ∉ � then
node � gets a value strictly less than *� So 3is not eigenvector! And so * is the largest eigenvalue!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
Details!
![Page 34: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/34.jpg)
� What if � is not connected?
� � has 2 components, each *-regular
� What are some eigenvectors?
� ' � Put all �s on � and 4s on # or vice versa
� '′ � ��,… , �, 4, … , 4! then 6 ⋅ '′ � *,… , *, 4, … , 4� '′′ � �4,… , 4, �, … , �! then � ⋅ '′′ � �4,… , 4, *,… , *!� And so in both cases the corresponding ) � *
� A bit of intuition:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
A B
A B
)� � )�7�
|A| |B|
A B
)� � )�7� 8 42nd largest eigval. 9:7;now has
value very close
to 9:
![Page 35: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/35.jpg)
� More intuition:
� If the graph is connected (right example) then we already know that '� � ��,…�! is an eigenvector
� Since eigenvectors are orthogonal then the components of '�7� sum to 0.
� Why? Because '� ⋅ '�7� � ∑ '� � ⋅ '�7�<�=�� So we can look at the eigenvector of the 2nd largest
eigenvalue and declare nodes with positive label in A and negative label in B.
� But there is still lots to sort out.J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
A B
)� � )�7�A B
)� � )�7� 8 42nd largest eigval. 9:7;now has
value very close
to 9:
![Page 36: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/36.jpg)
� Adjacency matrix (A):
� n×××× n matrix
� A=[aij], aij=1 if edge between node i and j
� Important properties:
� Symmetric matrix
� Eigenvectors are real and orthogonal
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
1
3
2
5
46
1 2 3 4 5 6
1 0 1 1 0 1 0
2 1 0 1 0 0 0
3 1 1 0 1 0 0
4 0 0 1 0 1 1
5 1 0 0 1 0 1
6 0 0 0 1 1 0
![Page 37: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/37.jpg)
� Degree matrix (D):
� n×××× n diagonal matrix
� D=[dii], dii = degree of node i
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
1
3
2
5
46
1 2 3 4 5 6
1 3 0 0 0 0 0
2 0 2 0 0 0 0
3 0 0 3 0 0 0
4 0 0 0 3 0 0
5 0 0 0 0 3 0
6 0 0 0 0 0 2
![Page 38: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/38.jpg)
� Laplacian matrix (L):
� n×××× n symmetric matrix
� What is trivial eigenpair?
� ' � ��,… , �! then > ⋅ ' � 4 and so ) � )� � 4� Important properties:
� Eigenvalues are non-negative real numbers
� Eigenvectors are real and orthogonal
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
> � ? � �1
3
2
5
46
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
![Page 39: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/39.jpg)
(a) All eigenvalues are @ 0(b) BCDB � ∑ DEFBEBFEF @ 0 for every B(c) D � GC ⋅ G� That is, D is positive semi-definite
� Proof:
� (c)⇒⇒⇒⇒(b): BCDB � BCGCGB � BG C GB @ 0� As it is just the square of length of GB
� (b)⇒⇒⇒⇒(a): Let ) be an eigenvalue of >. Then by (b)BCDB @ 0 so BCDB � BC9B � 9BCB⇒⇒⇒⇒ ) @ 4� (a)⇒⇒⇒⇒(c): is also easy! Do it yourself.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Details!
![Page 40: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/40.jpg)
� Fact: For symmetric matrix M:
� What is the meaning of min xT L x on G?
� xILx � ∑ DEF:E,FK; BEBF � ∑ LEF � MEF:E,FK; BEBF� � ∑ LEEBENE � ∑ 2BEBFE,F ∈O� � ∑ �BEN P BFN � 2BEBF!E,F ∈O � ∑ '� � '� ��,� ∈
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
xx
xMxT
T
x
min2 =λ
Node � has degree *�. So, value '�� needs to be summed up *� times.
But each edge ��, �! has two endpoints so we need '�� P'��
![Page 41: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/41.jpg)
� Write B in axes of eigenvecotrs Q;, QN, … , Q: of R. So, B � ∑ SEQE:E� Then we get: TB � ∑ SETQEE � ∑ SE9EQEE� So, what is 'UR'?
� BCTB � ∑ SEQEE ∑ SE9EQEE � ∑ SE9FSFQEQFEF� ∑ SE9EQEQEE � ∑ )�V���� To minimize this over all unit vectors x orthogonal to:
w = min over choices of �S;, … S:! so that:∑SEN � 1 (unit length) ∑SE � 0 (orthogonal to Q;)
� To minimize this, set V� � � and so ∑ )�V�� � )��J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
xx
xMxT
T
x
min2 =λ
)�W� � 4 if � X �1 otherwise
Details!
![Page 42: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/42.jpg)
� What else do we know about x?
� ' is unit vector: ∑ '�� � ��� ' is orthogonal to 1st eigenvector ��,… , �!thus: ∑ '� ⋅ �� � ∑ '�� � 4
� Remember:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
∑∑ −
=∈
2
2
),(
2
)(min
ii
jiEji
x
xxλ
All labelings
of nodes . so
that ∑BE � 0We want to assign values '� to nodes i such
that few edges cross 0.
(we want xi and xj to subtract each other)
BE 0
xBFBalance to minimize
![Page 43: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/43.jpg)
� Back to finding the optimal cut
� Express partition (A,B) as a vector+� � YP����Z� ∈ ��Z� ∈ #� We can minimize the cut of the partition by
finding a non-trivial vector x that minimizes:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43
3E � �1 0 3F � P1Can’t solve exactly. Let’s relax + andallow it to take any real value.
![Page 44: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/44.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
� )� � [\]+ Z + : The minimum value of Z�+! is
given by the 2nd smallest eigenvalue λ2 of the
Laplacian matrix L
� ^ � _`a[\]b Z + : The optimal solution for y
is given by the corresponding eigenvector ',
referred as the Fiedler vector
BE 0 xBF
![Page 45: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/45.jpg)
� Suppose there is a partition of G into A and B
where M c |e|, s.t. V � �#ghigjklmnopmq!othen 2V @ )�� This is the approximation guarantee of the spectral
clustering. It says the cut spectral finds is at most 2away from the optimal one of score V.
� Proof:
� Let: a=|A|, b=|B| and e= # edges from A to B
� Enough to choose some '� based on A and B such
that: 9N c ∑ rs7rt u∑ rsus c 2S (while also ∑ BE � 0E )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
Details!
)� is only smaller
![Page 46: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/46.jpg)
� Proof (continued):
� 1) Let’s set: '� � v� �wP �x�Z� ∈ ��Z� ∈ #� Let’s quickly verify that ∑ BE � 0: y � ;z P { ;| � 4E
� 2) Then:∑ rs7rt u∑ rsus � ∑ }~�}� us∈�,t∈�z 7}� u�| }~ u � g⋅ }��}~ u
}��}~ �� ;z P ;| c � ;z P ;z c � �w � �V
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46
Details!
Which proves that the cost
achieved by spectral is better
than twice the OPT coste … number of edges between A and B
![Page 47: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/47.jpg)
� Putting it all together:�V @ )� @ V�� w'� where �nzr is the maximum node degree
in the graph
� Note we only provide the 1st part: �V @ )�� We did not prove )� @ V�� w'
� Overall this always certifies that )� always gives a
useful bound
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47
Details!
![Page 48: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/48.jpg)
� How to define a “good” partition of a graph?
� Minimize a given graph cut criterion
� How to efficiently identify such a partition?
� Approximate using information provided by the
eigenvalues and eigenvectors of a graph
� Spectral Clustering
48J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 49: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/49.jpg)
� Three basic stages:
� 1) Pre-processing
� Construct a matrix representation of the graph
� 2) Decomposition
� Compute eigenvalues and eigenvectors of the matrix
� Map each point to a lower-dimensional representation
based on one or more eigenvectors
� 3) Grouping
� Assign points to two or more clusters, based on the new
representation
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49
![Page 50: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/50.jpg)
� 1) Pre-processing:� Build Laplacian
matrix L of the graph
� 2)Decomposition:� Find eigenvalues λλλλ
and eigenvectors xof the matrix L
� Map vertices to corresponding components of λλλλ2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50
0.0-0.4-0.40.4-0.60.4
0.50.4-0.2-0.5-0.30.4
-0.50.40.60.1-0.30.4
0.5-0.40.60.10.30.4
0.00.4-0.40.40.60.4
-0.5-0.4-0.2-0.50.30.4
5.0
4.0
3.0
3.0
1.0
0.0
λλλλ= X =
How do we now
find the clusters?
-0.66
-0.35
-0.34
0.33
0.62
0.31
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
![Page 51: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/51.jpg)
� 3) Grouping:� Sort components of reduced 1-dimensional vector
� Identify clusters by splitting the sorted vector in two� How to choose a splitting point?
� Naïve approaches: � Split at 0 or median value
� More expensive approaches:� Attempt to minimize normalized cut in 1-dimension
(sweep over ordering of nodes induced by the eigenvector)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51
-0.66
-0.35
-0.34
0.33
0.62
0.31 Split at 0:
Cluster A: Positive points
Cluster B: Negative points
0.33
0.62
0.31
-0.66
-0.35
-0.34
A B
![Page 52: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/52.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52
Rank in x2
Valu
e o
f x
2
![Page 53: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/53.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53
Rank in x2
Valu
e o
f x
2
Components of x2
![Page 54: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/54.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 54
Components of x1
Components of x3
![Page 55: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/55.jpg)
� How do we partition a graph into k clusters?
� Two basic approaches:
� Recursive bi-partitioning [Hagen et al., ’92]
� Recursively apply bi-partitioning algorithm in a
hierarchical divisive manner
� Disadvantages: Inefficient, unstable
� Cluster multiple eigenvectors [Shi-Malik, ’00]
� Build a reduced space from multiple eigenvectors
� Commonly used in recent papers
� A preferable approach…
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55
![Page 56: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/56.jpg)
� Approximates the optimal cut [Shi-Malik, ’00]� Can be used to approximate optimal k-way normalized
cut� Emphasizes cohesive clusters
� Increases the unevenness in the distribution of the data
� Associations between similar points are amplified, associations between dissimilar points are attenuated
� The data begins to “approximate a clustering”� Well-separated space
� Transforms data to a new “embedded space”, consisting of k orthogonal basis vectors
� Multiple eigenvectors prevent instability due to information loss
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56
![Page 57: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/57.jpg)
![Page 58: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/58.jpg)
� Searching for small communities in
the Web graph
� What is the signature of a community /
discussion in a Web graph?
[Kumar et al. ‘99]
Dense 2-layer graph
Intuition: Many people all talking about the same things
… …Use this to define “topics”:
What the same people on
the left talk about on the right
Remember HITS!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58
![Page 59: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/59.jpg)
� A more well-defined problem:
Enumerate complete bipartite subgraphs Ks,t� Where Ks,t : s nodes on the “left” where each links
to the same t other nodes on the “right”
K3,4
|X| = s = 3
|Y| = t = 4X Y
Fully connectedJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59
![Page 60: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/60.jpg)
� Market basket analysis. Setting:
� Market: Universe U of n items
� Baskets: m subsets of U: S1, S2, …, Sm ⊆⊆⊆⊆ U
(Si is a set of items one person bought)
� Support: Frequency threshold f
� Goal:
� Find all subsets T s.t. T ⊆⊆⊆⊆ Si of at least f sets Si
(items in T were bought together at least f times)
� What’s the connection between the
itemsets and complete bipartite graphs?
[Agrawal-Srikant ‘99]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60
![Page 61: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/61.jpg)
Frequent itemsets = complete bipartite graphs!
� How?
� View each node i as a set Si of nodes i points to
� Ks,t = a set Y of size tthat occurs in s sets Si
� Looking for Ks,t� set of frequency threshold to sand look at layer t – all frequent sets of size t
[Kumar et al. ‘99]
ib
c
d
a
Si={a,b,c,d}
j
i
k
b
c
d
a
X Y
s … minimum support (|X|=s)
t … itemset size (|Y|=t)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 61
![Page 62: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/62.jpg)
[Kumar et al. ‘99]
ib
c
d
a
Si={a,b,c,d}
x
y
z
b
c
a
X Y
Find frequent itemsets:
s … minimum support
t … itemset size
xb
c
a
We found Ks,t!
Ks,t = a set Y of size t
that occurs in s sets Si
View each node i as a
set Si of nodes i points to
Say we find a frequent
itemset Y={a,b,c} of supp s
So, there are s nodes that
link to all of {a,b,c}:
za
b
c
yb
c
a
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 62
![Page 63: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/63.jpg)
� Support threshold s=2
� {b,d}: support 3
� {e,f}: support 2
� And we just found 2 bipartite
subgraphs:
c
a b
d
f
Itemsets:
a = {b,c,d}
b = {d}
c = {b,d,e,f}
d = {e,f}
e = {b,d}
f = {}
e
c
a b
d
e
c
d
fe
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 63
![Page 64: Stanford University · 2014-08-11 · Split 1:2 1+1 paths to H Split evenly The algorithm: •Add edge flows:-- node flow = 1+ ∑child edges -- split the flow up based on the parent](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08dfe37e708231d4242490/html5/thumbnails/64.jpg)
� Example of a community from a web graph
Nodes on the right Nodes on the left
[Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999]J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 64