plan cs 312: graphs: bfs and dfs · cs 312: graphs: bfs and dfs dan sheldon february 5, 2015 plan i...
TRANSCRIPT
CS 312: Graphs: BFS and DFS
Dan Sheldon
February 5, 2015
Plan
I Gale-Shapley Running Time
I Graphs
I Motivation and definitionsI Graph traversal: BFS and DFS
Running Time of Gale-Shapley?
Initially all colleges and students are freewhile some college is free and hasn’t made o↵ers to everystudent do
Choose such a college cLet s be the highest ranked student to whom c has not made
an o↵erif s is free then
c and s become engagedelse if s is engaged to c0 but prefers c to c0 then
c0 becomes freec and s become engaged
else
c remains freeend if
end while
O(n2) iterations. Are all statements inside the loop constant time?
Data Structures
Running-time depends on implementation details and datastructures (e.g. how to “choose such a college c”).
I Q: How should we think about data structures when designingalgorithms?
I A: Most of the time, as black boxes with running-timeguarantees (e.g., “find an element in O(log n) time”).
Good news: don’t need to remember details of data structuresBad news: they may seem opaque
Review: Lists and Arrays
Array List
Get ith entry O(1) O(i)
Find element O(n)O(log n) if sorted
O(n)
Insert/delete O(n) O(1)
Note: O(1) = constant number of steps
What Data Structures to Use For G-S?
Need to do the following in O(1) time
I Find free college c
I Find next student s in preference list of c
I Find current college c0 of s
I Check if s likes c0 better than c
What Data Structures to Use For G-S?
Input: prefence lists = 2D arrays, e.g.CollegePref[c, i] = student in position i on c’s list
Operation Data structure
Fnd a free college c linked list: freeColleges
Find next student s inpreference list of c
arrayi = Next[c]
s = CollegePref[c, i]
Find current college c0 of s 1-D array: Current[s]
Check if s likes c0 better than c 2-D array: Ranking[s,c]
example on board
Another Example: Heapsort
Input unsorted array A[] of length n
Let Q be a heap-based priority queuefor i = 1 to n do
Insert(Q, A[i])end for
for i = 1 to n do
A[i] = ExtractMin(Q)end for
Running time: (n⇥ Insert) + (n⇥ ExtractMin)
I O(n log n) if both operations are O(log n)
Graphs
I Motivation and Definitions
I Breadth-First Search (BFS) and Depth-First Search (DFS)
Undirected GraphUndirected graph. G = (V, E)!V = nodes (vertices)!E = edges between pairs of nodes.!Captures pairwise relationship between objects.!Graph size parameters: n = |V|, m = |E|.
V = {1, 2, 3, 4, 5}!E = {(1,2), (1,4), (1,5), (2,3), (2,4), ! (3,5)}!n=5!m=6
1
3
5 4
2
Graphs
I Google Maps: what is the shortest driving route from SouthHadley to Florida?
I Facebook: how many “degrees of separation” between me andBarck Obama?
I And many more. . .
Four Degrees of Separation
Lars Backstrom
⇤Paolo Boldi
†Marco Rosa
†Johan Ugander
⇤Sebastiano Vigna
†
January 6, 2012
Abstract
Frigyes Karinthy, in his 1929 short story “Láncszemek”(“Chains”) suggested that any two persons are distanced byat most six friendship links.1 Stanley Milgram in his famousexperiment [20, 23] challenged people to route postcards to afixed recipient by passing them only through direct acquain-tances. The average number of intermediaries on the pathof the postcards lay between 4.4 and 5.7, depending on thesample of people chosen.
We report the results of the first world-scale social-networkgraph-distance computations, using the entire Facebook net-work of active users (⇡ 721 million users, ⇡ 69 billion friend-ship links). The average distance we observe is 4.74, cor-responding to 3.74 intermediaries or “degrees of separation”,showing that the world is even smaller than we expected, andprompting the title of this paper. More generally, we studythe distance distribution of Facebook and of some interest-ing geographic subgraphs, looking also at their evolution overtime.
The networks we are able to explore are almost two ordersof magnitude larger than those analysed in the previous liter-ature. We report detailed statistical metadata showing thatour measurements (which rely on probabilistic algorithms)are very accurate.
1 Introduction
At the 20th World–Wide Web Conference, in Hyderabad, In-dia, one of the authors (Sebastiano) presented a new tool for
⇤Facebook.†DSI, Università degli Studi di Milano, Italy. Paolo Boldi, Marco
Rosa and Sebastiano Vigna have been partially supported by a Ya-hoo! faculty grant and by MIUR PRIN “Query log e web crawling”.
1The exact wording of the story is slightly ambiguous: “He bet usthat, using no more than five individuals, one of whom is a personal ac-quaintance, he could contact the selected individual [. . . ]”. It is not com-pletely clear whether the selected individual is part of the five, so thiscould actually allude to distance five or six in the language of graph the-ory, but the “six degrees of separation” phrase stuck after John Guare’s1990 eponymous play. Following Milgram’s definition and Guare’s inter-pretation (see further on), we will assume that “degrees of separation”is the same as “distance minus one”, where “distance” is the usual pathlength (the number of arcs in the path).
studying the distance distribution of very large graphs: Hy-perANF [3]. Building on previous graph compression [4] workand on the idea of diffusive computation pioneered in [21],the new tool made it possible to accurately study the dis-tance distribution of graphs orders of magnitude larger thanit was previously possible.
One of the goals in studying the distance distribution is theidentification of interesting statistical parameters that canbe used to tell proper social networks from other complexnetworks, such as web graphs. More generally, the distancedistribution is one interesting global feature that makes itpossible to reject probabilistic models even when they matchlocal features such as the in-degree distribution.
In particular, earlier work had shown that the spid2,which measures the dispersion of the distance distribution,appeared to be smaller than 1 (underdispersion) for so-cial networks, but larger than one (overdispersion) for webgraphs [3]. Hence, during the talk, one of the main openquestions was “What is the spid of Facebook?”.
Lars Backstrom happened to listen to the talk, and sug-gested a collaboration studying the Facebook graph. Thiswas of course an extremely intriguing possibility: beside test-ing the “spid hypothesis”, computing the distance distributionof the Facebook graph would have been the largest Milgram-like [20] experiment ever performed, orders of magnitudeslarger than previous attempts (during our experiments Face-book has ⇡ 721 million active users and ⇡ 69 billion friend-ship links).
This paper reports our findings in studying the distancedistribution of the largest electronic social network ever cre-ated. That world is smaller than we thought: the averagedistance of the current Facebook graph is 4.74. Moreover, thespid of the graph is just 0.09, corroborating the conjecture [3]that proper social networks have a spid well below one. Wealso observe, contrary to previous literature analysing graphsorders of magnitude smaller, both a stabilisation of the aver-age distance over time, and that the density of the Facebookgraph over time does not neatly fit previous models.
Towards a deeper understanding of the structure of theFacebook graph, we also apply recent compression techniques
2The spid (shortest-paths index of dispersion) is the variance-to-mean ratio of the distance distribution.
1
arX
iv:1
111.
4570
v3 [
cs.S
I] 5
Jan
2012
Four Degrees of Separation
Lars Backstrom
⇤Paolo Boldi
†Marco Rosa
†Johan Ugander
⇤Sebastiano Vigna
†
January 6, 2012
Abstract
Frigyes Karinthy, in his 1929 short story “Láncszemek”(“Chains”) suggested that any two persons are distanced byat most six friendship links.1 Stanley Milgram in his famousexperiment [20, 23] challenged people to route postcards to afixed recipient by passing them only through direct acquain-tances. The average number of intermediaries on the pathof the postcards lay between 4.4 and 5.7, depending on thesample of people chosen.
We report the results of the first world-scale social-networkgraph-distance computations, using the entire Facebook net-work of active users (⇡ 721 million users, ⇡ 69 billion friend-ship links). The average distance we observe is 4.74, cor-responding to 3.74 intermediaries or “degrees of separation”,showing that the world is even smaller than we expected, andprompting the title of this paper. More generally, we studythe distance distribution of Facebook and of some interest-ing geographic subgraphs, looking also at their evolution overtime.
The networks we are able to explore are almost two ordersof magnitude larger than those analysed in the previous liter-ature. We report detailed statistical metadata showing thatour measurements (which rely on probabilistic algorithms)are very accurate.
1 Introduction
At the 20th World–Wide Web Conference, in Hyderabad, In-dia, one of the authors (Sebastiano) presented a new tool for
⇤Facebook.†DSI, Università degli Studi di Milano, Italy. Paolo Boldi, Marco
Rosa and Sebastiano Vigna have been partially supported by a Ya-hoo! faculty grant and by MIUR PRIN “Query log e web crawling”.
1The exact wording of the story is slightly ambiguous: “He bet usthat, using no more than five individuals, one of whom is a personal ac-quaintance, he could contact the selected individual [. . . ]”. It is not com-pletely clear whether the selected individual is part of the five, so thiscould actually allude to distance five or six in the language of graph the-ory, but the “six degrees of separation” phrase stuck after John Guare’s1990 eponymous play. Following Milgram’s definition and Guare’s inter-pretation (see further on), we will assume that “degrees of separation”is the same as “distance minus one”, where “distance” is the usual pathlength (the number of arcs in the path).
studying the distance distribution of very large graphs: Hy-perANF [3]. Building on previous graph compression [4] workand on the idea of diffusive computation pioneered in [21],the new tool made it possible to accurately study the dis-tance distribution of graphs orders of magnitude larger thanit was previously possible.
One of the goals in studying the distance distribution is theidentification of interesting statistical parameters that canbe used to tell proper social networks from other complexnetworks, such as web graphs. More generally, the distancedistribution is one interesting global feature that makes itpossible to reject probabilistic models even when they matchlocal features such as the in-degree distribution.
In particular, earlier work had shown that the spid2,which measures the dispersion of the distance distribution,appeared to be smaller than 1 (underdispersion) for so-cial networks, but larger than one (overdispersion) for webgraphs [3]. Hence, during the talk, one of the main openquestions was “What is the spid of Facebook?”.
Lars Backstrom happened to listen to the talk, and sug-gested a collaboration studying the Facebook graph. Thiswas of course an extremely intriguing possibility: beside test-ing the “spid hypothesis”, computing the distance distributionof the Facebook graph would have been the largest Milgram-like [20] experiment ever performed, orders of magnitudeslarger than previous attempts (during our experiments Face-book has ⇡ 721 million active users and ⇡ 69 billion friend-ship links).
This paper reports our findings in studying the distancedistribution of the largest electronic social network ever cre-ated. That world is smaller than we thought: the averagedistance of the current Facebook graph is 4.74. Moreover, thespid of the graph is just 0.09, corroborating the conjecture [3]that proper social networks have a spid well below one. Wealso observe, contrary to previous literature analysing graphsorders of magnitude smaller, both a stabilisation of the aver-age distance over time, and that the density of the Facebookgraph over time does not neatly fit previous models.
Towards a deeper understanding of the structure of theFacebook graph, we also apply recent compression techniques
2The spid (shortest-paths index of dispersion) is the variance-to-mean ratio of the distance distribution.
1
arX
iv:1
111.
4570
v3 [
cs.S
I] 5
Jan
2012
Definitions
TerminologyIf e = (u, v) is an edge, then: (1) u is a neighbor of v (2) u is adjacent to v (3) e is incident on u and v (4) u and v are the endpoints of e
1
3
5 4
21 is adjacent to 2!(1,2) is incident on 1 and 2
1
3
5 4
2
Path
A path is a sequence P of nodes v1, v2, …, vk-1, vk with the property that each consecutive pair vi, vi+1 is joined by an edge in E.
1
3
5 4
2
1-4-2 is a path.!1-3-4 is NOT a path.
1
3
5 4
2
Distance
The distance from u to v is the minimum number of edges in any path from u to v
1
3
5 4
2
distance(1,2) = 1!distance(1,3) = 2
Cycle
A cycle is a path v1, v2, …, vk-1, vk in which v1 = vk, k > 2, and the first k-1 nodes are all distinct.
1-2-4-1 is a cycle.!1-2-4 is NOT a cycle.!1-2-4-1-5 is NOT a cycle.!1-2-4-1-5-3-2-1 is NOT a cycle.
1
3
5 4
2
ConnectivityAn undirected graph is connected if for every pair of nodes u and v, there is a path between u and v.
1
3
5 4
2
is a connected graph.
1
3
5 4
2
is NOT a connected graph.
TreesA tree is an undirected graph that is connected and does not contain a cycle.
1
3
5 4
2
is NOT a tree
1
3
5 4
2
is a tree
Trees
1
3
5
4
2 1 3
5
4
2
1
3
5 4
2
http://www.offbeattravel.com/MoCA.html
(Upside-down) Parents, descendants, ancestors?
Review Definitions
What to know: n, m, neighbor, incident, path, distance, cycle, connected, tree!Example on board
Graph TraversalIs a graph connected?
1
3
5 4
2
easy hmmm...
Graph Traversal
Is a graph connected?!Approach: explore outward from arbitrary starting node s to find all nodes reachable from s (connected component)
Is a Graph Connected?Algorithm 1: Breadth-first search (BFS)Explore outward by distance
a
e
b d
c
Start at a:
a
e
b d
c
Visit all nodes at distance 1 from a:
a
e
b d
c
Visit all nodes at distance 2 from a:
Breadth-First Search
Layers
I L0 = {s}I L1 = all neighbors of L0
I L2 = all nodes with an edge to L1 that don’t belong to L0 or L1
I . . .I Li+1 = nodes with an edge to Li that don’t belong to an earlier
layer:
Li+1 = {v : 9(u, v) 2 E, u 2 Li, v /2 (L0 [ . . . [ Li)}
Observation: Li consists of all nodes at distance exactly i from s.There is a path from s to t if and only if t appears in some layer.
BFS Tree
If we keep only the edges traversed while doing a breadth-firstsearch, we will have a tree
Example on board
BFS TreeProperty. Let T be a BFS tree of G = (V, E), and let (x, y) be an edge of G. Then the layer of x and y differ by at most 1.
a
e
b d
c Layer 0: {a}!Layer 1: {b, c, d}!Layer 2: {e}
Proof on board
A More General Strategy
To explore the connected component, add any node v for which
I(u, v) is an edge
I u is explored, but v is not
Picture on board
Is a Graph Connected?Algorithm 2: Depth-first search (DFS) - Keep exploring from most recently added node until you have to backtrack
a
e
b d
c a
e
b d
c
a
e
b d
c
a
e
b d
c
a
e
b d
c
DFS Algorithm
DFS(u)
Mark u as ”Explored”for each edge (u, v) incident to u do
if v is not marked ”Explored” then
Recursively invoke DFS(v)end if
end for
Depth First SearchTheorem: Let T be a depth-first search tree. Let x and y be 2 nodes in the tree. Let (x, y) be an edge that is in G but not in T. Then either x is an ancestor of y or y is an ancestor of x in T.
Proof?
a
e
b d
c
a
e
b
d
c
Summary
Definitions
I G = (V,E), n = |V |,m = |E|I neighbor, incident, cycle, path, connected
BFS and DFS
I Two ways to traverse a graph, each produces a treeI BFS tree: shallow and wide (“bushy”)I DFS tree: deep and narrow (“scraggly”)