plan cs 312: graphs: bfs and dfs · cs 312: graphs: bfs and dfs dan sheldon february 5, 2015 plan i...

6
CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal: BFS and DFS Running Time of Gale-Shapley? Initially all colleges and students are free while some college is free and hasn’t made oers to every student do Choose such a college c Let s be the highest ranked student to whom c has not made an oer if s is free then c and s become engaged else if s is engaged to c 0 but prefers c to c 0 then c 0 becomes free c and s become engaged else c remains free end if end while O(n 2 ) iterations. Are all statements inside the loop constant time? Data Structures Running-time depends on implementation details and data structures (e.g. how to “choose such a college c”). I Q: How should we think about data structures when designing algorithms? I A: Most of the time, as black boxes with running-time guarantees (e.g., “find an element in O(log n) time”). Good news: don’t need to remember details of data structures Bad news: they may seem opaque Review: Lists and Arrays Array List Get ith entry O(1) O(i) Find element O(n) O(log n) if sorted O(n) Insert/delete O(n) O(1) Note: O(1) = constant number of steps What Data Structures to Use For G-S? Need to do the following in O(1) time I Find free college c I Find next student s in preference list of c I Find current college c 0 of s I Check if s likes c 0 better than c

Upload: others

Post on 22-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Plan CS 312: Graphs: BFS and DFS · CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal:

CS 312: Graphs: BFS and DFS

Dan Sheldon

February 5, 2015

Plan

I Gale-Shapley Running Time

I Graphs

I Motivation and definitionsI Graph traversal: BFS and DFS

Running Time of Gale-Shapley?

Initially all colleges and students are freewhile some college is free and hasn’t made o↵ers to everystudent do

Choose such a college cLet s be the highest ranked student to whom c has not made

an o↵erif s is free then

c and s become engagedelse if s is engaged to c0 but prefers c to c0 then

c0 becomes freec and s become engaged

else

c remains freeend if

end while

O(n2) iterations. Are all statements inside the loop constant time?

Data Structures

Running-time depends on implementation details and datastructures (e.g. how to “choose such a college c”).

I Q: How should we think about data structures when designingalgorithms?

I A: Most of the time, as black boxes with running-timeguarantees (e.g., “find an element in O(log n) time”).

Good news: don’t need to remember details of data structuresBad news: they may seem opaque

Review: Lists and Arrays

Array List

Get ith entry O(1) O(i)

Find element O(n)O(log n) if sorted

O(n)

Insert/delete O(n) O(1)

Note: O(1) = constant number of steps

What Data Structures to Use For G-S?

Need to do the following in O(1) time

I Find free college c

I Find next student s in preference list of c

I Find current college c0 of s

I Check if s likes c0 better than c

Page 2: Plan CS 312: Graphs: BFS and DFS · CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal:

What Data Structures to Use For G-S?

Input: prefence lists = 2D arrays, e.g.CollegePref[c, i] = student in position i on c’s list

Operation Data structure

Fnd a free college c linked list: freeColleges

Find next student s inpreference list of c

arrayi = Next[c]

s = CollegePref[c, i]

Find current college c0 of s 1-D array: Current[s]

Check if s likes c0 better than c 2-D array: Ranking[s,c]

example on board

Another Example: Heapsort

Input unsorted array A[] of length n

Let Q be a heap-based priority queuefor i = 1 to n do

Insert(Q, A[i])end for

for i = 1 to n do

A[i] = ExtractMin(Q)end for

Running time: (n⇥ Insert) + (n⇥ ExtractMin)

I O(n log n) if both operations are O(log n)

Graphs

I Motivation and Definitions

I Breadth-First Search (BFS) and Depth-First Search (DFS)

Undirected GraphUndirected graph. G = (V, E)!V = nodes (vertices)!E = edges between pairs of nodes.!Captures pairwise relationship between objects.!Graph size parameters: n = |V|, m = |E|.

V = {1, 2, 3, 4, 5}!E = {(1,2), (1,4), (1,5), (2,3), (2,4), ! (3,5)}!n=5!m=6

1

3

5 4

2

Graphs

I Google Maps: what is the shortest driving route from SouthHadley to Florida?

I Facebook: how many “degrees of separation” between me andBarck Obama?

I And many more. . .

Four Degrees of Separation

Lars Backstrom

⇤Paolo Boldi

†Marco Rosa

†Johan Ugander

⇤Sebastiano Vigna

January 6, 2012

Abstract

Frigyes Karinthy, in his 1929 short story “Láncszemek”(“Chains”) suggested that any two persons are distanced byat most six friendship links.1 Stanley Milgram in his famousexperiment [20, 23] challenged people to route postcards to afixed recipient by passing them only through direct acquain-tances. The average number of intermediaries on the pathof the postcards lay between 4.4 and 5.7, depending on thesample of people chosen.

We report the results of the first world-scale social-networkgraph-distance computations, using the entire Facebook net-work of active users (⇡ 721 million users, ⇡ 69 billion friend-ship links). The average distance we observe is 4.74, cor-responding to 3.74 intermediaries or “degrees of separation”,showing that the world is even smaller than we expected, andprompting the title of this paper. More generally, we studythe distance distribution of Facebook and of some interest-ing geographic subgraphs, looking also at their evolution overtime.

The networks we are able to explore are almost two ordersof magnitude larger than those analysed in the previous liter-ature. We report detailed statistical metadata showing thatour measurements (which rely on probabilistic algorithms)are very accurate.

1 Introduction

At the 20th World–Wide Web Conference, in Hyderabad, In-dia, one of the authors (Sebastiano) presented a new tool for

⇤Facebook.†DSI, Università degli Studi di Milano, Italy. Paolo Boldi, Marco

Rosa and Sebastiano Vigna have been partially supported by a Ya-hoo! faculty grant and by MIUR PRIN “Query log e web crawling”.

1The exact wording of the story is slightly ambiguous: “He bet usthat, using no more than five individuals, one of whom is a personal ac-quaintance, he could contact the selected individual [. . . ]”. It is not com-pletely clear whether the selected individual is part of the five, so thiscould actually allude to distance five or six in the language of graph the-ory, but the “six degrees of separation” phrase stuck after John Guare’s1990 eponymous play. Following Milgram’s definition and Guare’s inter-pretation (see further on), we will assume that “degrees of separation”is the same as “distance minus one”, where “distance” is the usual pathlength (the number of arcs in the path).

studying the distance distribution of very large graphs: Hy-perANF [3]. Building on previous graph compression [4] workand on the idea of diffusive computation pioneered in [21],the new tool made it possible to accurately study the dis-tance distribution of graphs orders of magnitude larger thanit was previously possible.

One of the goals in studying the distance distribution is theidentification of interesting statistical parameters that canbe used to tell proper social networks from other complexnetworks, such as web graphs. More generally, the distancedistribution is one interesting global feature that makes itpossible to reject probabilistic models even when they matchlocal features such as the in-degree distribution.

In particular, earlier work had shown that the spid2,which measures the dispersion of the distance distribution,appeared to be smaller than 1 (underdispersion) for so-cial networks, but larger than one (overdispersion) for webgraphs [3]. Hence, during the talk, one of the main openquestions was “What is the spid of Facebook?”.

Lars Backstrom happened to listen to the talk, and sug-gested a collaboration studying the Facebook graph. Thiswas of course an extremely intriguing possibility: beside test-ing the “spid hypothesis”, computing the distance distributionof the Facebook graph would have been the largest Milgram-like [20] experiment ever performed, orders of magnitudeslarger than previous attempts (during our experiments Face-book has ⇡ 721 million active users and ⇡ 69 billion friend-ship links).

This paper reports our findings in studying the distancedistribution of the largest electronic social network ever cre-ated. That world is smaller than we thought: the averagedistance of the current Facebook graph is 4.74. Moreover, thespid of the graph is just 0.09, corroborating the conjecture [3]that proper social networks have a spid well below one. Wealso observe, contrary to previous literature analysing graphsorders of magnitude smaller, both a stabilisation of the aver-age distance over time, and that the density of the Facebookgraph over time does not neatly fit previous models.

Towards a deeper understanding of the structure of theFacebook graph, we also apply recent compression techniques

2The spid (shortest-paths index of dispersion) is the variance-to-mean ratio of the distance distribution.

1

arX

iv:1

111.

4570

v3 [

cs.S

I] 5

Jan

2012

Four Degrees of Separation

Lars Backstrom

⇤Paolo Boldi

†Marco Rosa

†Johan Ugander

⇤Sebastiano Vigna

January 6, 2012

Abstract

Frigyes Karinthy, in his 1929 short story “Láncszemek”(“Chains”) suggested that any two persons are distanced byat most six friendship links.1 Stanley Milgram in his famousexperiment [20, 23] challenged people to route postcards to afixed recipient by passing them only through direct acquain-tances. The average number of intermediaries on the pathof the postcards lay between 4.4 and 5.7, depending on thesample of people chosen.

We report the results of the first world-scale social-networkgraph-distance computations, using the entire Facebook net-work of active users (⇡ 721 million users, ⇡ 69 billion friend-ship links). The average distance we observe is 4.74, cor-responding to 3.74 intermediaries or “degrees of separation”,showing that the world is even smaller than we expected, andprompting the title of this paper. More generally, we studythe distance distribution of Facebook and of some interest-ing geographic subgraphs, looking also at their evolution overtime.

The networks we are able to explore are almost two ordersof magnitude larger than those analysed in the previous liter-ature. We report detailed statistical metadata showing thatour measurements (which rely on probabilistic algorithms)are very accurate.

1 Introduction

At the 20th World–Wide Web Conference, in Hyderabad, In-dia, one of the authors (Sebastiano) presented a new tool for

⇤Facebook.†DSI, Università degli Studi di Milano, Italy. Paolo Boldi, Marco

Rosa and Sebastiano Vigna have been partially supported by a Ya-hoo! faculty grant and by MIUR PRIN “Query log e web crawling”.

1The exact wording of the story is slightly ambiguous: “He bet usthat, using no more than five individuals, one of whom is a personal ac-quaintance, he could contact the selected individual [. . . ]”. It is not com-pletely clear whether the selected individual is part of the five, so thiscould actually allude to distance five or six in the language of graph the-ory, but the “six degrees of separation” phrase stuck after John Guare’s1990 eponymous play. Following Milgram’s definition and Guare’s inter-pretation (see further on), we will assume that “degrees of separation”is the same as “distance minus one”, where “distance” is the usual pathlength (the number of arcs in the path).

studying the distance distribution of very large graphs: Hy-perANF [3]. Building on previous graph compression [4] workand on the idea of diffusive computation pioneered in [21],the new tool made it possible to accurately study the dis-tance distribution of graphs orders of magnitude larger thanit was previously possible.

One of the goals in studying the distance distribution is theidentification of interesting statistical parameters that canbe used to tell proper social networks from other complexnetworks, such as web graphs. More generally, the distancedistribution is one interesting global feature that makes itpossible to reject probabilistic models even when they matchlocal features such as the in-degree distribution.

In particular, earlier work had shown that the spid2,which measures the dispersion of the distance distribution,appeared to be smaller than 1 (underdispersion) for so-cial networks, but larger than one (overdispersion) for webgraphs [3]. Hence, during the talk, one of the main openquestions was “What is the spid of Facebook?”.

Lars Backstrom happened to listen to the talk, and sug-gested a collaboration studying the Facebook graph. Thiswas of course an extremely intriguing possibility: beside test-ing the “spid hypothesis”, computing the distance distributionof the Facebook graph would have been the largest Milgram-like [20] experiment ever performed, orders of magnitudeslarger than previous attempts (during our experiments Face-book has ⇡ 721 million active users and ⇡ 69 billion friend-ship links).

This paper reports our findings in studying the distancedistribution of the largest electronic social network ever cre-ated. That world is smaller than we thought: the averagedistance of the current Facebook graph is 4.74. Moreover, thespid of the graph is just 0.09, corroborating the conjecture [3]that proper social networks have a spid well below one. Wealso observe, contrary to previous literature analysing graphsorders of magnitude smaller, both a stabilisation of the aver-age distance over time, and that the density of the Facebookgraph over time does not neatly fit previous models.

Towards a deeper understanding of the structure of theFacebook graph, we also apply recent compression techniques

2The spid (shortest-paths index of dispersion) is the variance-to-mean ratio of the distance distribution.

1

arX

iv:1

111.

4570

v3 [

cs.S

I] 5

Jan

2012

Page 3: Plan CS 312: Graphs: BFS and DFS · CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal:

Definitions

TerminologyIf e = (u, v) is an edge, then: (1) u is a neighbor of v (2) u is adjacent to v (3) e is incident on u and v (4) u and v are the endpoints of e

1

3

5 4

21 is adjacent to 2!(1,2) is incident on 1 and 2

1

3

5 4

2

Path

A path is a sequence P of nodes v1, v2, …, vk-1, vk with the property that each consecutive pair vi, vi+1 is joined by an edge in E.

1

3

5 4

2

1-4-2 is a path.!1-3-4 is NOT a path.

1

3

5 4

2

Distance

The distance from u to v is the minimum number of edges in any path from u to v

1

3

5 4

2

distance(1,2) = 1!distance(1,3) = 2

Cycle

A cycle is a path v1, v2, …, vk-1, vk in which v1 = vk, k > 2, and the first k-1 nodes are all distinct.

1-2-4-1 is a cycle.!1-2-4 is NOT a cycle.!1-2-4-1-5 is NOT a cycle.!1-2-4-1-5-3-2-1 is NOT a cycle.

1

3

5 4

2

ConnectivityAn undirected graph is connected if for every pair of nodes u and v, there is a path between u and v.

1

3

5 4

2

is a connected graph.

1

3

5 4

2

is NOT a connected graph.

Page 4: Plan CS 312: Graphs: BFS and DFS · CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal:

TreesA tree is an undirected graph that is connected and does not contain a cycle.

1

3

5 4

2

is NOT a tree

1

3

5 4

2

is a tree

Trees

1

3

5

4

2 1 3

5

4

2

1

3

5 4

2

http://www.offbeattravel.com/MoCA.html

(Upside-down) Parents, descendants, ancestors?

Review Definitions

What to know: n, m, neighbor, incident, path, distance, cycle, connected, tree!Example on board

Graph TraversalIs a graph connected?

1

3

5 4

2

easy hmmm...

Graph Traversal

Is a graph connected?!Approach: explore outward from arbitrary starting node s to find all nodes reachable from s (connected component)

Is a Graph Connected?Algorithm 1: Breadth-first search (BFS)Explore outward by distance

a

e

b d

c

Start at a:

a

e

b d

c

Visit all nodes at distance 1 from a:

a

e

b d

c

Visit all nodes at distance 2 from a:

Page 5: Plan CS 312: Graphs: BFS and DFS · CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal:

Breadth-First Search

Layers

I L0 = {s}I L1 = all neighbors of L0

I L2 = all nodes with an edge to L1 that don’t belong to L0 or L1

I . . .I Li+1 = nodes with an edge to Li that don’t belong to an earlier

layer:

Li+1 = {v : 9(u, v) 2 E, u 2 Li, v /2 (L0 [ . . . [ Li)}

Observation: Li consists of all nodes at distance exactly i from s.There is a path from s to t if and only if t appears in some layer.

BFS Tree

If we keep only the edges traversed while doing a breadth-firstsearch, we will have a tree

Example on board

BFS TreeProperty. Let T be a BFS tree of G = (V, E), and let (x, y) be an edge of G. Then the layer of x and y differ by at most 1.

a

e

b d

c Layer 0: {a}!Layer 1: {b, c, d}!Layer 2: {e}

Proof on board

A More General Strategy

To explore the connected component, add any node v for which

I(u, v) is an edge

I u is explored, but v is not

Picture on board

Is a Graph Connected?Algorithm 2: Depth-first search (DFS) - Keep exploring from most recently added node until you have to backtrack

a

e

b d

c a

e

b d

c

a

e

b d

c

a

e

b d

c

a

e

b d

c

DFS Algorithm

DFS(u)

Mark u as ”Explored”for each edge (u, v) incident to u do

if v is not marked ”Explored” then

Recursively invoke DFS(v)end if

end for

Page 6: Plan CS 312: Graphs: BFS and DFS · CS 312: Graphs: BFS and DFS Dan Sheldon February 5, 2015 Plan I Gale-Shapley Running Time I Graphs I Motivation and definitions I Graph traversal:

Depth First SearchTheorem: Let T be a depth-first search tree. Let x and y be 2 nodes in the tree. Let (x, y) be an edge that is in G but not in T. Then either x is an ancestor of y or y is an ancestor of x in T.

Proof?

a

e

b d

c

a

e

b

d

c

Summary

Definitions

I G = (V,E), n = |V |,m = |E|I neighbor, incident, cycle, path, connected

BFS and DFS

I Two ways to traverse a graph, each produces a treeI BFS tree: shallow and wide (“bushy”)I DFS tree: deep and narrow (“scraggly”)