algorithms for big data: graphs and memory errors 4 (lecture by giuseppe italiano)

27
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computing the Diameter of Large-Scale Graphs (Based on work by Crescenzi, Grossi, Habib, Lanzi & Marino) ALMADA School, July – August 2013 ALMADA School Diameter of Large-Scale Graphs

Upload: anton-konushin

Post on 11-May-2015

2.469 views

Category:

Education


0 download

DESCRIPTION

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

TRANSCRIPT

Page 1: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.

......

Computing the Diameter ofLarge-Scale Graphs

(Based on work by Crescenzi, Grossi, Habib, Lanzi & Marino)

ALMADA School, July – August 2013

ALMADA School Diameter of Large-Scale Graphs

Page 2: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.Definition..

......

(Un)weighted (un)directed graph G = (V ,E ) (w : E → R)(Strongly) connected.

The distance d(u, v) is the number (sum of the weights) ofedges along shortest path from u to v .

The diameter D of a graph is the length of the longestshortest path, D = maxu,v∈V d(u, v)

ALMADA School Diameter of Large-Scale Graphs

Page 3: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

...Definition..

......

Forward Eccentricity of u: in how many

hops can u reach any node?

eccF(u) = maxv∈V d(u, v)

Backward Eccentricity of u: in how many

hops can u be reached from any node?

eccB(u) = maxv∈V d(v , u)

Diameter: maximum eccF or eccB

..

v12

.

v11

.

v2

.

v1

.

v3

.

v8

.

v10

.

v5

.

v9

.

v6

.

v4

.

v7

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 eccF

v1 0 2 1 3 1 3 2 3 2 4 1 2 4v2 1 0 2 1 2 3 2 3 3 4 2 3 4v3 2 1 0 2 3 2 1 2 4 3 3 4 4v4 1 3 2 0 2 2 1 2 3 3 2 3 3v5 3 2 1 2 0 3 2 2 1 3 4 5 5v6 2 4 3 1 3 0 2 3 4 4 3 4 4v7 3 4 3 2 2 1 0 1 3 2 4 5 5v8 4 3 2 3 1 4 3 0 2 1 5 6 6v9 2 4 3 1 2 3 2 1 0 2 3 4 4v10 5 4 3 4 2 5 4 1 3 0 6 7 7v11 2 4 3 5 3 5 4 5 4 6 0 1 6v12 1 3 2 4 2 4 3 4 3 5 2 0 5

eccB 5 4 3 5 3 5 4 5 4 6 6 7

ALMADA School Diameter of Large-Scale Graphs

Page 4: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

..

..

v12

.

v11

.

v2

.

v1

.

v3

.

v8

.

v10

.

v5

.

v9

.

v6

.

v4

.

v7

i FFi (v1) FB

i (v1)

1 v3, v5, v11 v2, v4, v122 v2, v7, v9, v12 v3, v6, v9, v113 v4, v6, v8 v5, v74 v10 v85 v10

.Forward BFS Tree [O(m) time]..

......

For any i , the forward fringe, F Fi (u)

(which nodes are at distance i from u)

.Backward BFS Tree [O(m) time]..

......

For any i , the backward fringe, FBi (u)

(from which nodes, u is at distance i)...

v1

.

v3

.

v11

.

v12

.

v5

.

v2

.

v7

.

v9

.

v4

.

v6

.

v8

.

v10

.

v1

.

v12

.

v11

.

v2

.

v4

.

v3

.

v6

.

v5

.

v7

.

v9

.

v8

.

v10

ALMADA School Diameter of Large-Scale Graphs

Page 5: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Known approaches

.

......

Textbook Algorithm:

Perform n fbfs and return maximum eccF.Each fbfs takes O(m) time.Total O(mn) time. Too expensive.

Several other approaches (see [Zwick, 2001]) that solves allpairs shortest paths. Still too expensive.

O(n(3+ω)/2 log n) where ω is the exponent of the matrixmultiplication.

Empirically finding lower bound L and upper bound U

That is, L ≤ D ≤ UD found, when L = U

ALMADA School Diameter of Large-Scale Graphs

Page 6: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Known approaches

.

......

In case of undirected graphs:

Exact algorithm by Takes & Kosters [CIKM 2012]: up to 1Mnodes, 100M edges

Approximation algorithm by Ajwani, Meyer, Veith [ESA 2012]:O(m

√n + n2) time to produce estimate D such that

⌊2/3D⌋ ≤ D ≤ D.

Roditty and Vassilevska Williams [STOC 2013]:

same estimate in O(m√n ) expected running time

if for some ϵ > 0 an algorithm for undirected unweightedgraphs runs in O(m2−ϵ) time and produces an approximation

D, with (2/3 + ϵ)D ≤ D ≤ D, then SAT for CNF formulas onn variables can be solved in O((2− δ)n) time for someconstant δ > 0, and thus the widely believed strongexponential time hypothesis (SETH) of Impagliazzo, Paturi &Zane [JCSS’01] fails.

ALMADA School Diameter of Large-Scale Graphs

Page 7: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Easy lower and upper bounds..By using Single source (fbfs) and Single target (bbfs) Shortest Path

..

v12

.

v11

.

v2

.

v1

.

v3

.

v8

.

v10

.

v5

.

v9

.

v6

.

v4

.

v7

..v1

.

v3

.

v11

.

v12

.

v5

.

v2

.

v7

.

v9

.

v4

.

v6

.

v8

.

v10

.v1

.

v12

.

v11

.

v2

.

v4

.

v3

.

v6

.

v5

.

v7

.

v9

.

v8

.

v10

Lower Bound The maximum between the forward, eccF, and the backward eccentricity,eccB, of a node.In the example, lower bound is 5: at least a pair is at distance 5.

Upper bound The forward eccentricity plus the backward eccentricity eccB (height of thebbfs tree) of a node.In the example, upper bound is 9: every node can reach another node goingto v1 in ≤ 5 steps and going to the destination in ≤ 4 steps.

.

......

x : d(x , u) = i (x ∈ FBi (u)) and y : d(u, y) = j (y ∈ FF

j (u)) =⇒ d(x , y) ≤ i + ji + j is the length of a path from x to y passing through u.

Very often: L < D < U (will see in the experiments)In the example, diameter is 7: d(v10, v12) = 7.

ALMADA School Diameter of Large-Scale Graphs

Page 8: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Good lower bounds in undirected graphs

.2-Sweep..

......

...1 Run a bfs from a random node r : let a be the farthest node.

...2 Run a bfs from a: let b be the farthest node.

...3 Return the length of the path from a to b.

..

r

.a .

b

Return d(a, b).

ALMADA School Diameter of Large-Scale Graphs

Page 9: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Experiments

Machine with:

Pentium Dual-Core CPU (Intel E5200 @ 2.50GHz),

8GB shared memory.

Running with:

OS Debian GNU/Linux 6.0,

Linux kernel version 2.6.32

gcc version 4.4.5.

Code and the data set available athttp://piluc.dsi.unifi.it/lasagne/

ALMADA School Diameter of Large-Scale Graphs

Page 10: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Experiments: effectiveness of 2-Sweep

By starting from the highest degree node.

2-dSweepHdOutCategory # of Net-

works# of Net-works inwhich lbis tight

Maximumerror

Protein-Protein Interaction 14 11 1Collaboration 14 12 1Undirected Social 4 4 0Undirected Communication 36 34 2Autonomous System 2 1 1Road 3 1 14Word Adjacency 7 4 1

ALMADA School Diameter of Large-Scale Graphs

Page 11: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Bad cases for 2-Sweep

..

x1

.

· · ·

.

xp

.y

In this modified grid with k rows and 1 + 3k/2 columns. Thealgorithm returns k + 1. The diameter of the network is instead3k/2.

ALMADA School Diameter of Large-Scale Graphs

Page 12: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Good lower bounds in directed graphs...2-dSweep..

......

...1 Run a forward bfs from a random node r : let a1 be the farthest node.

...2 Run a backward bfs from a1: let b1 be the farthest node.

...3 Run a backward bfs from r : let a2 be the farthest node.

...4 Run a forward bfs from a2: let b2 be the farthest node.

...5 If eccB(a1) > eccF (a2), then return the length of the path from b1 to a1.Otherwise return the length of the path from a2 to b2.

..

r

.a1

.

b1

..

r

.a2

.

b2

Return themaximum betweend(a2, b2) andd(b1, a1).

First time used in directed graph by Broder et al. to study Graph structure in the web.

ALMADA School Diameter of Large-Scale Graphs

Page 13: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

..

Lower bound: experiments(snap.stanford.edu and webgraph.dsi.unimi.it dataset)

..Numb.of runs Worst

Network name D (out of 10) LBin which foundLB = D

Wiki-Vote 9 10 9p2p-Gnutella08 19 9 18p2p-Gnutella09 19 9 18p2p-Gnutella06 19 10 19p2p-Gnutella05 22 9 21p2p-Gnutella04 25 7 22p2p-Gnutella25 21 8 20p2p-Gnutella24 28 10 28p2p-Gnutella30 23 2 22p2p-Gnutella31 30 9 29s.s.Slashdot081106 15 10 15s.s.Slashdot090216 15 10 15s.s.Slashdot090221 15 10 15soc-Epinions1 16 9 15Email-EuAll 10 10 10soc-sign-epinions 16 10 16web-NotreDame 93 10 93Slashdot0811 12 10 12Slashdot0902 13 3 12WikiTalk 10 9 9web-Stanford 210 10 210web-BerkStan 679 10 679web-Google 51 10 51

Numb.of runs Worst

Network name D (out of 10) LBin which foundLB = D

wordassociation-2011 10 9 9enron 10 10 10uk-2007-05@100000 7 10 7cnr-2000 81 10 81uk-2007-05@1000000 40 10 40in-2004 56 10 56amazon-2008 47 10 47eu-2005 82 10 82indochina-2004 235 10 235uk-2002 218 10 218arabic-2005 133 10 133uk-2005 166 10 166it-2004 873 10 873

ALMADA School Diameter of Large-Scale Graphs

Page 14: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Experiments: effectiveness of 2-dSweep

By starting from the highest out-degree or the highest in-degreenode.

2-dSweepHdOut 2-dSweepHdInCategory # of

Net-works

# ofNet-works inwhichlb istight

Max er-ror

# ofNet-works inwhichlb istight

Max er-ror

Metabolic Bipartite 76 73 19 75 19Metabolic Compound 76 73 9 75 9Metabolic Reaction 76 73 10 75 10Directed Social 10 10 0 10 0Web 16 16 0 16 0Citation 2 2 0 2 0Communication 3 3 0 2 1P2P 9 8 1 7 1Product co-Purchasing 5 5 0 5 0Word-association 1 1 0 1 0

ALMADA School Diameter of Large-Scale Graphs

Page 15: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Directed iterative fringe upper bound (difub )...Recall that..

......

The trivial algorithm runs a fbfs for any node and return themaximum eccF found (or a bbfs for any node and return themaximum eccB found).

.difub is just a special case in which we:..

......

specify the order in which the bfses have to be executedrefine a lower bound,

that is the maximum eccF or eccB found until that moment.

upper bound the eccentricities of the remaining nodes.stop when the remaining nodes cannot have eccentricityhigher than current lower bound.

ALMADA School Diameter of Large-Scale Graphs

Page 16: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Finding a good order for bfses

.

......

...1 Find a starting node u:

highest out-degree nodehighest in-degree node“central” node

...2 Find a good order by analyzing how nodes are placed in thefbfs or bbfs tree of u.

ALMADA School Diameter of Large-Scale Graphs

Page 17: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Finding a “central” node

.With heuristic 2-dSweep..

......

...1 Run a forward bfs from a random node r : let a1 be the farthest node.

...2 Run a backward bfs from a1: let b1 be the farthest node.

...3 Run a backward bfs from r : let a2 be the farthest node.

...4 Run a forward bfs from a2: let b2 be the farthest node.

...5 If eccB(a1) > eccF (a2), then set u as the middle node between a1 and b1and the lower bound ℓ equal to eccB(a1). Otherwise, set u as the middlenode between a2 and b2 and the lower bound ℓ = eccF (a2).

..

r

.a1

.

b1

..

r

.a2

.

b2

ALMADA School Diameter of Large-Scale Graphs

Page 18: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Main idea for bounding eccentricities

.Theorem..

......

For any integer i with 1 < i ≤ eccB(u), for any integer k with 1 ≤ k < i , and for any

node x ∈ FBi−k (u) such that eccF (x) > 2(i − 1), there exists y ∈ FF

j (u), for some

j ≥ i , such that d(x , y) = eccF (x).

..

u

.

u

.

Level i

.Level j

.

x

.y

If the forward eccentricity of x is> 2(i − 1), the node y , such thatd(x , y) > 2(i − 1), is below in thefbfs tree of u.

Analogously:

.Theorem..

......

For any integer i with 1 < i ≤ eccF (u), for any integer k with 1 ≤ k < i , and for any

node x ∈ FFi−k (u) such that eccB(x) > 2(i − 1), there exists y ∈ FB

j (u), for some

j ≥ i , such that d(y , x) = eccB(x).

ALMADA School Diameter of Large-Scale Graphs

Page 19: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.

......

What the theorems say:

...1 For each node x above level i in bbfs(u) with eccF(x) > 2(i − 1) there must bea corresponding node y on or below level i in fbfs(u), with eccB(y) ≥ eccF(x).

...2 For each node t above level i in fbfs(u) with eccB(t) > 2(i − 1) there must bea corresponding node z on or below level i in bbfs(u), with eccF(z) ≥ eccB(t).

..

u

.

u

.

Level i

.

x

.

y

.

z

.

t

.

......

Theorems above suggest following algorithm:

...1 Perform forward and backward bfs from a node u and visit trees fbfs(u) andbbfs(u) bottom-up

...2 For each level i , compute the eccentricities of all nodes at level i . At this point,have all the eccB of nodes y and the eccF of nodes z below level i . Let lowerbound ℓi be the current maximum.

...3 If ℓi is already bigger than 2(i − 1), then no node to be examined can have eccFor eccB bigger than ℓi : stop and output ℓi as the diameter!

ALMADA School Diameter of Large-Scale Graphs

Page 20: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Upper bound: experiments (snap.stanford.edu dataset)..

Network name n m Avg. Visits Visits worst run

Wiki-Vote 1300 39456 17 17p2p-Gnutella08 2068 9313 45.9 64p2p-Gnutella09 2624 10776 202.1 230p2p-Gnutella06 3226 13589 236.6 279p2p-Gnutella05 3234 13453 60.4 94p2p-Gnutella04 4317 18742 36.7 38p2p-Gnutella25 5153 17695 85.1 161p2p-Gnutella24 6352 22928 13 13p2p-Gnutella30 8490 31706 255.4 516p2p-Gnutella31 14149 50916 208.7 255s.s.Slashdot081106 26996 337351 22.3 25s.s.Slashdot090216 27222 342747 21.5 26s.s.Slashdot090221 27382 346652 22.8 26soc-Epinions1 32223 443506 6.1 7Email-EuAll 34203 151930 6 6soc-sign-epinions 41441 693737 6 6web-NotreDame 53968 304685 7 7Slashdot0811 70355 888662 40 40Slashdot0902 71307 912381 32.9 40WikiTalk 111881 1477893 13.6 19web-Stanford 150532 1576314 6 6web-BerkStan 334857 4523232 7 7web-Google 434818 3419124 9.4 10

ALMADA School Diameter of Large-Scale Graphs

Page 21: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Upper bound: experiments (webgraph.dsi.unimi.it dataset)..

Network name n m Avg. Visits Visits worst run

wordassociation-2011 4845 61567 412.5 423enron 8271 147353 19 22uk-2007-05@100000 53856 1683102 14 14cnr-2000 112023 1646332 17 17uk-2007-05@1000000 480913 22057738 6 6in-2004 593687 7827263 14 14amazon-2008 627646 4706251 136.3 598eu-2005 752725 17933415 6 6indochina-2004 3806327 98815195 8 8uk-2002 12090163 232137936 6 6arabic-2005 15177163 473619298 58 58uk-2005 25711307 704151756 170 170it-2004 29855421 938694394 87 87

ALMADA School Diameter of Large-Scale Graphs

Page 22: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Experiments for directed graphs..

1

10

100

1000

10000

100000

100 1000 10000 100000 1e+006 1e+007 1e+008

visi

ts

nodes

diFUBHdOutdiFUBHdIn

diFUB+2dSweepHdOutdiFUB+2SweepHdIn

ALMADA School Diameter of Large-Scale Graphs

Page 23: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Experiments for directed graphs..

.

......

Performance gain increases (exponentially) with graph sizeFor big graphs (geq 10,000 nodes), ≤ 0.001n visits (instead of n)

Number of visits performed is asymptotically constant?

ALMADA School Diameter of Large-Scale Graphs

Page 24: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Experiments for undirected graphs..

1

10

100

1000

10000

100000

100 1000 10000 100000 1e+006 1e+007

visi

ts

nodes

iFUBHdiFUB+2SweepHd

.

......

The undirected version of difub (called ifub) computed thediameter of Facebook with just 17 bfses.

ALMADA School Diameter of Large-Scale Graphs

Page 25: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Bad cases for difub and ifub

Cases in which nodes have close eccentricity.

1. A cycle

.

All nodes with eccentricity equal to diameter

D/2 + 1 iterations will always be executed

ALMADA School Diameter of Large-Scale Graphs

Page 26: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Bad cases for difub and ifub

Cases in which nodes have close eccentricity.

2. Special regular graphs [such as Moore graphs]

.

All nodes with eccentricity equal to diameter

D/2 + 1 iterations will always be executed

ALMADA School Diameter of Large-Scale Graphs

Page 27: Algorithms for Big Data: Graphs and Memory Errors 4 (Lecture by Giuseppe Italiano)

..........

.....

.....................................................................

.....

......

.....

.....

.

.. Future Work?

2-Sweep (both directed and undirected) seems effective infinding tight lower bounds (except for road networks). Why?

It is known that in chordal graphs the error of 2-Sweep canbe at most 1.Do real-world networks (except for road networks) have someproperty that may be related to some sort of chordalitymeasure?

Understand why the difub and ifub methods work so well ingeneral.

Might be related to eccentricities distribution?

Design faster (external memory / parallel) implementations ofbfs.

ALMADA School Diameter of Large-Scale Graphs