algorithms for big data: graphs and memory errors 3 (lecture by giuseppe italiano)

Original Plan

1.  Algorithms for BIG graphs

•  The centrality of centrality

•  How to store BIG Graphs (WebGraph Framework)

•  Four Degrees of Separation

•  Diameter and Radius

2.  Big Data and Memory Errors

Slightly Revised Plan

1.  Algorithms for BIG graphs

•  The centrality of centrality

•  Four Degrees of Separation

•  Diameter (no Radius)

•  How to store BIG Graphs (WebGraph Framework)

2.  Big Data and Memory Errors

Four Degrees of Separa.on

Literature •  Frigyes Karinthy, in his 1929 short story “Láncszemek” (“Chains'”) suggested that any two persons are distanced by at most six friendship links

•  Just an (op.mis.c) posi.vis.c statement about combinatorial explosion

•  Used by John Guare's in his 1990 eponymous play (and 1993 movie by Fred Schepisi)

4

The Sociologists

•  M. Kochen, I. de Sola Pool: Contacts and influences. (Manuscript, early 50s)

•  A. Rapoport, W.J. Horvath: A study of a large sociogram. (Behav.Sci. 1961)

•  S. Milgram, An experimental study of the small world problem. (Sociometry, 1969)…

5

Milgram’s ques.on

•  “Given two individuals selected randomly from the popula.on, what is the probability that the minimum number of intermediaries required to link them is 0, 1, 2, . . . , k?”

6

Milgram’s ques.on

•  What is the distance distribu.on of the acquaintance graph? –  how many pairs are friends? –  how many are not friends but have a friend in common?

– … •  Note on “distance”:

–  sociologists measure the degrees of separa.on –  as computer scien.sts we measure the graph-‐theore.c distance (just add one)

7

Milgram’s experiment 296 people (star.ng popula.on) asked to dispatch a parcel to a single individual (target) •  Target: a Boston stockholder

•  Star.ng popula.on selected as follows: –  100 were random Boston inhabitants (group A) –  100 were random Nebraska stockholders (group B) –  96 were random Nebraska inhabitants (group C)

•  Rule of the game: parcels could be directly sent only to someone the sender knows personally (“first-‐name acquaintance”)

8

Milgram’s experiment

9

Milgram’s experiment •  Actually completed 64 chains (22%)

•  453 intermediaries happened to be involved in the experiments

•  Average distance of the completed chains was 6.2, ranging from 5.4 (Boston group) to 6.7 (random Nebraska inhabitants)

•  Distance 6.7 was 5.7 degrees of separa.on (thus six degrees of separa.on) 10

New Milgram’s experiment?

•  Can we reproduce Milgram’s experiment on a large scale?

•  How can one compute or approximate the distance distribu6on of a given huge friendship graph? (such as Facebook)

11

Graph distances and distribu.on

•  The distance d(x,y) is the length of the shortest path from x to y – d(x,y) = ∞ if one cannot go from x to y

•  For undirected graphs d(x,y) = d(y,x) •  For every t, count the number of pairs (x,y) such that d(x,y) = t (distance distribu.on)

•  The frac.on of pairs at distance t is (the density func.on of) a distribu.on

12

Previous experiments On-‐line social networks: •  6.6 degrees of separa.on on a one-‐month MSN Messenger communica.on graph (Leskovec and Horvitz [LH08]) – 180 M nodes and 1.3 G arcs

•  3.67 degrees of separa.on in Twioer [KLPM10] – 5 G follows –  Is this meaningful? In Twioer, links created without permission at both ends…

13

•  4.74 degrees of separa6on in Facebook (Backstrom et al. [Backstrom+12]) – 712 M people and 69 G friendship links

14

Hyper ANF

•  New tool for studying/compu.ng distance distribu.on of very large graphs

•  Diffusion-‐based approximated algorithm •  Based on WebGraph (Boldi et al [BV04]) and ANF [Palmer et al., 2002]

•  It uses HyperLogLog counters [Flajolet et al.,2007] and broadword programming for low-‐level paralleliza.on

15

Compu.ng Neighborhood Func.on •  Neighborhood func.on N(t): for each t, number of pairs at distance ≤ t –  provides data about how fast the “average ball” around each node expands

•  Many breadth-‐first visits: O(mn), need direct access L •  Sampling: a frac.on of breadth-‐first visits, very unreliable results on graphs that are not strongly connected, needs direct access L

•  Edith Cohen’s [JCSS 1997] size es.ma.on framework: very powerful (strong theore.cal guarantees) but does not scale really well, needs direct access L

16

Alterna.ve: Diffusion

•  Basic idea: Palmer et al [PGF02] •  Let Bt(x) be the ball of radius t around x (nodes at distance at most t from x)

•  Clearly B0(x)={x} •  But also Bt+1(x) = ∪x→y Bt(y)∪{x} •  So we can compute balls by enumera.ng the arcs x→y and performing unions (on sets)

•  The neighborhood func.on at t is given by the sum of the sizes of the balls of radius t, i.e., Bt(x1), Bt(x2), …, Bt(xn) !

17

Easy but Costly: Approximate •  Every set needs O(n) bits: overall O(n2). Too many.

•  All we need is cardinality es.mators: es.mate # of dis.nct elements (nodes) in very large mul.sets (balls around nodes), presented as massive streams

•  Do not es.mate cardinality exactly: use probabilis.c coun.ng to get approximate es.mates

•  Wish to choose approximate sets such that unions can be computed quickly

18

ANF and HyperANF ANF [PGF02]: •  Diffusion with Mar.n-‐Flajolet (MF) counters (log n + c space)

•  MF counters can be combined with OR HyperANF [BRV11a]: •  Diffusion with HyperLogLog counters [Flajolet+07] (loglog n space)

•  Use broadword programming to combine HyperLogLog counters quickly!

19

HyperLogLog counters

20

21

Rough Intui.on: let x be unknown cardinality of M. Each substream will contain approximately (x/m) different elements. Then, its Max-‐parameter should be close to log2(x/m). Harmonic mean of quan..es 2Max (mZ in our nota.on) likely to be of the order of (x/m). Thus, m2Z should be of the order of x. Constant αm to correct systema.c mul.plica.ve bias in m2Z .

HyperLogLog counters •  Instead of actually coun.ng, observe a sta.s.cal feature of a set (stream) of elements

•  Feature: # of trailing zeroes of the value of a very good hash func.on

•  Keep track of maximum Max (log log n bits!) •  # of dis.nct elements ∝ 2Max

•  Important: counter of stream AB is simply maximum of counters of A and B!

•  With 40 bits can count up to 4 billion with standard devia.on of 6%

22

Many many counters… •  To increase confidence, need several counters (usually 2b, b ≥ 4) and take their harmonic mean

•  Thus each set is represented by a list of small counters

•  How small? Typically 5-‐bit is enough: 232 > 1G. For huge graphs, 7 bits (unlikely > 7 bits!)

•  To compute the union of two sets these must be maximized one-‐by-‐one

•  Extrac.ng by shi�s, maximizing and pu�ng back by shi�s is unbearably slow

•  Mar.n–Flajolet (ANF) just OR the features! •  Exponen.al reduc.on in space (from log n to loglog n) but unions get more complicated….

23

Broadword Programming 8 bits

9 0 3 2

7 3 3 6


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

=


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

= 2 125 0 124 1 0 1 0


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

= 2 125 0 124

1 1 1 1

1 0 1 0


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

= 2 125 0 124

0 1 0 1 1 1 1 1


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

= 2 125 0 124

0 1 0 1 1 1 1 1

1 1 1 1 0 0 0 0


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

= 2 125 0 124

0 1 0 1 1 1 1 1

1 1 1 1 0 0 0 0 -‐

=


9 0 3 2

7 3 3 6

1 1 1 1

0 0 0 0 -‐

= 2 125 0 124

0 1 0 1 1 1 1 1

1 1 1 1 0 0 0 0 -‐

= 127 0 127 0 1 0 1 0

Real speed?

•  Large size: HADI [Kang et al., 2010] is a Hadoop-‐conscious implementa.on of ANF. Takes 30 minutes on a 200K-‐node graph (on one of the 50 world largest supercomputers).

•  HyperANF does the same in couple of minutes on a worksta.on (tens min on a laptop).

35

Experiments

•  24-‐core machine with: – 72 GB of RAM – 1 TB of disk space

Experiments (.me)

Ran experiments on snapshots of facebook •  Jan 1, 2007 •  Jan 1, 2008 •  ... •  Jan 1, 2011 •  May, 2011 (721.1M nodes, 68.7G edges)

37

Experiments (dataset)

Considered: •  �: the whole facebook graph •  it / se: only Italian / Swedish users •  it+se: only Italian & Swedish users •  us: only US users Based on users’ current geo-‐IP loca.on

38

Facebook distance distribu.on

39

4,74

Distance distribu.ons

40

Average Distance

41

Average Degree and Density (�)

42 Density defined as 2m / n (n-‐1)

Diameter (�)

43

How to compute the diameter?

44

Lower bounds byproduct of HyperANF runs What about the exact diameter? Computa.on was rela.vely fast: diameter of Facebook required 10 hours of computa.on on machine with 1TB of RAM (although 256GB would have been sufficient)

The WebGraph Framework

(Boldi & Vigna 2003)

“The” Web graph

•  Set U of URLs. Directed graph with: –  U as set of nodes –  an arc from x to y iff the page with URL x has a hyperlink poin.ng to URL y.

•  Transpose graph: reverse all arcs (useful for several advanced ranking algs, i.e., HITS)

•  Web graph is HUGE! (≈ 100 M nodes, 1 G links) 46

Page A hyperlink Page B Anchor

Storing the Web graph •  What does it mean “to store (part of) the Web graph”?

–  being able to know the successors of each node in a reasonable .me (e.g., much less than 1 ms/link)

–  having a simple way to know the node corresponding to a URL (e.g., minimal perfect hash)

–  having a simple way to know the URL corresponding to a node (e.g., front-‐coded lists).

•  W.l.o.g.,assume each URL represented by an integer (0, 1, . . . , n−1, where n = |U|)

47

Adjacency lists

•  The set of neighbors of a node •  E.g., for a 4 billion page web, need 32 bits per node (each URL represented by an integer)

•  Naively, this demands 64 bits to represent each hyperlink

•  That’s too much: Web graph with 1 G links would require more than 64Gb

Sec. 20.4

(Some) History •  Connec.vity Server [Bharat+98] ≈ 32 bits/link •  Algorithms for separable graphs [BBK03] ≈ 5 bits/link

•  WebBase [Cho+06] ≈ 5.6 bits/link •  LINK database [Randall+02] ≈ 4.5 bits/link •  WebGraph [Boldi Vigna 04] ≈ 3 bits/link

hop://webgraph.dsi.unimi.it/

49

Exploit features of Web Graphs 1. Locality: usually most links in a page have naviga.onal nature and so are local (i.e., they point to other pages on the same host). Source and target of those links share a long common prefix. If URLs are sorted lexicographically, the index of source and target are close to each other. E.g.:. – www.stanford.edu/alchemy – www.stanford.edu/biology – www.stanford.edu/biology/plant – www.stanford.edu/biology/plant/copyright – www.stanford.edu/biology/plant/people – www.stanford.edu/chemistry Literature reports that on average 80% of the links are local

Locality Property: with URLs lexicographically ordered, for many arcs x→y we have o�en small |x − y| Improvement: represent the successors y1 < y2 < ·∙·∙·∙ < yk of x using their gaps instead:

y1 − x, y2 − y1 − 1, ... , yk − yk−1 − 1 All integers non-‐nega.ve, except for first one. To avoid this, use following coding for first integer:

51

Naïve Representa.on and with Gaps

52

2. Similarity: Pages that are close to each other (in lexicographic order) tend to have many common successors. This is because many naviga.onal links are the same on the same local cluster of pages (even non-‐naviga.onal links are o�en copied from one page to another on the same host)

Exploit features of Web Graphs

Similarity •  Property: URLs that are close in lexicographic order are likely to belong to the same site (probably to the same level of the site hierarchy)

•  Consequence: they are likely to have similar successor lists

•  Improvement: code successors list by referen6a6on –  a reference r which tells us to start from the list of x − r (if r > 0)

–  a bit string which tells us which successors must be copied –  a list of extra-‐nodes for the remaining nodes Choice of r cri.cal: r chosen as value between 0 and W (fixed parameter, windows size) that gives the best compression. Larger W yields beoer compression rate but slower and more memory-‐consuming compression and decompression 54

Reference compression

55

Differen.al compression Look at copy list as sequence of 1-‐blocks and 0-‐blocks Length of each block decremented by 1, except for 1st block 1st copy block always assumed to be a 1-‐block (so 1st copy block is 0 if copy list starts with 0) Last block always omioed: its value can be deduced from # blocks and outdegree

56

Differen.al compression

57

Copy blocks specify by inclusion/exclusion sublists that must be alterna.vely copied or discarded Using typical codes, such as γ coding, copying en.rely a list costs 1 bit: can code a link in less than 1 bit!

Exploit features of Web Graphs 3. Consecu6vity: Many links within same page are likely to be consecu.ve (w.r.t. lexicographic order). Mainly due to two dis.nct phenomena: 1.  Most pages contain sets of naviga.onal links which

point to a fixed level of the hierarchy. Because of hierarchical nature of URLs, links in pages at booom leve l o f h ierarchy tend to be ad jacent lexicographically.

2.  In the transposed Web graph: pages that are high in the site hierarchy (e.g., the home page) are pointed to by most pages of the site.

Consecu.vity

•  More in general, consecu.vity is the dual of distance-‐one similarity. If a graph is easily compressible using similarity at distance one, its transpose must sport large intervals of consecu.ve links, and viceversa.

59

Intervaliza.on •  Exploit consecu.vity among nodes. Instead of compressing them directly using gaps, first isolate subsequences corresponding to integer intervals (only intervals with length above threshold Lmin considered). Compress list of extra nodes as follows: – A list of integer intervals: each interval represented by its le� extreme and its length; le� extremes compressed thru difference with previous right extreme -‐2 , interval lengths decremented by Lmin

– A list of residuals (remaining integers), compressed through differences.

60

Lmin = 2 Interval [15,19]: le� extreme 15-‐15=0, length 5-‐2=3

Interval [23,24]: le� extreme 23-‐19-‐2=2, length 2-‐2=0 Residuals are 13, 203, 315, 1034, which gives 13-‐15=-‐2 (encoded as 2|-‐2|+1=5), 203-‐13-‐1=189, 315-‐203-‐1=111, 1034-‐315-‐1=718.

61

62

Intervaliza.on

Choices in the reference scheme •  How do you choose the reference node for x? •  You consider the successor lists of the last W nodes, but. . . you do not consider lists which would cause a recursive reference of more than R (maximum reference count) chains. No limit on R: accessing adjacency list of x may require a decompression of all lists up to x!

•  Tuning parameter R essen.al for deciding the ra.o compression/speed: small R gives worst compression but shorter (random) access .mes.

•  W essen.ally decreases compression .me only. 63

Implementa.on

•  Random access to successor lists is implemented lazily through a cascade of iterators.

•  Each series of interval and each reference cause the crea.on of an iterator; the same happens for references.

•  The results of all iterators are then merged. •  The advantage of laziness is that we never have to build an actual list of successors in memory, so the overhead is limited to the number of actual reads, not to the number of successors lists that would be necessary to re-‐create a given one.

64

Access speed •  Access speed to a compressed graph is commonly

measured in the .me required to access a link (≈ 300 ns for WebGraph).

•  This quan.ty, however, is strongly dependent on the architecture (e.g., cache size), and, even more, on low-‐level op.miza.ons (e.g., hard-‐coding of the first codewords of an instantaneaous code).

•  To compare speeds reliably, we need public data, that anyone can access, and a common framework for the low-‐level opera.ons.

•  A first step is hop://webgraph-‐data.dsi.unimi.it/. Freely available data to compare compression techniques.

65

Summary WebGraph provides methods to manage very large (web) graphs. It consists of: •  ζ codes: par.cularly suitable for storing web graphs (or integers with power law distribu.on in a certain exponen.al range)

•  Algorithms for compressing web graphs that exploit referen.a.on, intervaliza.on and ζ codes to provide high compression ra.o

•  Algorithms for accessing a compressed graph without actually decompressing it, using lazy techniques that delay decompression un.l it is actually necessary

66

Conclusions WebGraph combines new codes, new insights on the structure of the Web graph and new algorithmic techniques to achieve a very high compression ra.o, while s.ll retaining a good access speed (can it be beoer?). So�ware is highly tunable: can experiment with dozens of codes, algorithmic techniques and compression parameters, and there is a large unexplored space of combina.ons. A theore.cally interes.ng ques.on is how to combine op.mally differen.al compression and intervaliza.on: can you do beoer than current greedy approach (first copy as much as you can, then intervalize)? 67

Conclusions The compression techniques are specialized for Web Graphs.

The average link size decreases with the increase of the graph.

The average link access .me increases with the increase of the graph.

ζ-‐codes seems to achieve great trade-‐off between avg. bit size and access .me.

References 1/2 1.  [AJB00] Albert, Jeong, and Barabási. Error and aoack tolerance of complex networks. Nature,

406:378–382, 2000. 2.  [Backstrom+12] Backstrom, Boldi, Rosa, Ugander, and Vigna. 2012. Four degrees of separa.on.

In Proceedings of the 3rd Annual ACM Web Science Conference (WebSci '12) 3.  [BBK03] Blandford, Blelloch and Kash. 2003. Compact representa.ons of separable graphs.

In Proceedings of the 14th ACM-‐SIAM symposium on Discrete algorithms (SODA '03) 4.  [Bharat+98] Bharat, Broder, Henzinger, Kumar, and Venkatasubramanian. 1998. The connec.vity

server: fast access to linkage informa.on on the Web. In Proceedings of the seventh interna.onal conference on World Wide Web 7 (WWW7)

5.  [Bra01] Brandes. A faster algorithm for betweenness centrality. Journal of Mathema.cal Sociology, 25(2):163–177, 2001.

6.  [BRV11a] Boldi, Rosa, and Vigna. 2011. HyperANF: approxima.ng the neighbourhood func.on of very large graphs on a budget. In Proceedings of the 20th interna.onal conference on World wide web (WWW '11)

7.  [BRV11b] Boldi, Rosa, and Vigna. 2011. Robustness of social networks: compara.ve results based on distance distribu.ons. In Proceedings of the Third interna.onal conference on Social informa.cs (SocInfo'11)

8.  [BV04] Boldi and Vigna. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th interna.onal conference on World Wide Web (WWW '04)

9.  [Bor05] Borga�. Centrality and network flow. Social Networks, 27(1):55–71, 2005. 10.  [Cho+06] Cho, Garcia-‐Molina, Haveliwala, Lam, Paepcke, Raghavan, and Wesley. 2006. Stanford

WebBase components and applica.ons. ACM Trans. Internet Technol. 6 69

References 2/2 11.  [Crescenzi+12] Crescenzi, Grossi, Habib, Lanzi, and Marino. On compu.ng the diameter of real-‐

world undirected graphs. Theore.cal Computer Science, 2012. 12.  [Flajolet+07] Flajolet , Fusy , Gandouet and Meunier. 2008. HyperLogLog: the analysis of a near-‐

op.mal cardinality es.ma.on algorithm. In Proceedings of the 2007 interna.onal conference on Analysis of Algorithms (AOFA ’07)

13.  [Firmani+12] Firmani, Italiano, Laura, Orlandi, and Santaroni. 2012. Compu.ng strong ar.cula.on points and strong bridges in large scale graphs. InProceedings of the 11th interna.onal conference on Experimental Algorithms (SEA'12)

14.  [Italiano+12] Italiano, Laura, Santaroni, “Finding strong ar.cula.on points and strong bridges in linear .me”, Theore.cal Computer Science, vol. 447, 74–84, 2012.

15.  [KLPM10] Kwak, Lee, Park, and Moon. 2010. What is Twioer, a social network or a news media?. In Proceedings of the 19th interna.onal conference on World wide web (WWW '10)

16.  [LH08] Leskovec and Horvitz. 2008. Planetary-‐scale views on a large instant-‐messaging network. In Proceedings of the 17th interna.onal conference on World Wide Web (WWW '08)

17.  [Mislove+07] Mislove, Marcon, Gummadi, Druschel, and Bhaoacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC '07)

18.  [PGF02] Palmer, Gibbons, and Faloutsos. 2002. ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD interna.onal conference on Knowledge discovery and data mining (KDD '02)

19.  [Randall+02] Randall, Stata, Wiener, and Wickremesinghe. 2002. The Link Database: Fast Access to Graphs of the Web. In Proceedings of the Data Compression Conference (DCC '02)

20.  [RAK07] Raghavan, Albert, and Kumara. Near linear .me algorithm to detect community structures in large-‐scale networks. Physical Review E (Sta.s.cal, Nonlinear, and So� Maoer Physics), 76(3), 2007.

70

algorithms for big data: graphs and memory errors 3 (lecture by giuseppe italiano)

Education

centrality of centrality

big data

big graphs webgraph

x foreveryt

y suchthatdx

degrees of separation

memory errors

diusion basicidea