algorithms for big data: graphs and memory errors 3 (lecture by giuseppe italiano)

70
Original Plan 1. Algorithms for BIG graphs The centrality of centrality How to store BIG Graphs (WebGraph Framework) Four Degrees of Separation Diameter and Radius 2. Big Data and Memory Errors

Upload: anton-konushin

Post on 10-May-2015

1.130 views

Category:

Education


0 download

DESCRIPTION

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

TRANSCRIPT

Page 1: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Original Plan

1.  Algorithms for BIG graphs

•  The centrality of centrality

•  How to store BIG Graphs (WebGraph Framework)

•  Four Degrees of Separation

•  Diameter and Radius

2.  Big Data and Memory Errors

Page 2: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Slightly Revised Plan

1.  Algorithms for BIG graphs

•  The centrality of centrality

•  Four Degrees of Separation

•  Diameter (no Radius)

•  How to store BIG Graphs (WebGraph Framework)

2.  Big Data and Memory Errors

Page 3: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Four  Degrees  of  Separa.on  

Page 4: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Literature  •  Frigyes  Karinthy,  in  his  1929  short  story  “Láncszemek”  (“Chains'”)  suggested  that  any  two  persons  are  distanced  by  at  most  six  friendship  links  

•  Just  an  (op.mis.c)  posi.vis.c  statement  about  combinatorial  explosion  

•  Used  by  John  Guare's  in  his  1990  eponymous  play  (and  1993  movie  by  Fred  Schepisi)  

4  

Page 5: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

The  Sociologists  

•  M.  Kochen,  I.  de  Sola  Pool:  Contacts  and  influences.  (Manuscript,  early  50s)  

•  A.  Rapoport,  W.J.  Horvath:  A  study  of  a  large  sociogram.  (Behav.Sci.  1961)  

•  S.  Milgram,  An  experimental  study  of  the  small  world  problem.  (Sociometry,  1969)…  

5  

Page 6: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Milgram’s  ques.on  

•  “Given  two  individuals  selected  randomly  from  the  popula.on,  what  is  the  probability  that  the  minimum  number  of  intermediaries  required  to  link  them  is  0,  1,  2,  .  .  .  ,  k?”  

6  

Page 7: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Milgram’s  ques.on  

•  What  is  the  distance  distribu.on  of  the  acquaintance  graph?  –  how  many  pairs  are  friends?  –  how  many  are  not  friends  but  have  a  friend  in  common?  

– …  •  Note  on  “distance”:  

–  sociologists  measure  the  degrees  of  separa.on  –  as  computer  scien.sts  we  measure  the  graph-­‐theore.c  distance  (just  add  one)  

7  

Page 8: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Milgram’s  experiment    296  people  (star.ng  popula.on)  asked  to  dispatch  a  parcel  to  a  single  individual  (target)    •  Target:  a  Boston  stockholder  

•  Star.ng  popula.on  selected  as  follows:  –  100  were  random  Boston  inhabitants  (group  A)  –  100  were  random  Nebraska  stockholders  (group  B)  –  96  were  random  Nebraska  inhabitants  (group  C)    

•  Rule  of  the  game:  parcels  could  be  directly  sent  only  to  someone  the  sender  knows  personally  (“first-­‐name  acquaintance”)  

8  

Page 9: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Milgram’s  experiment    

9  

Page 10: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Milgram’s  experiment    •  Actually  completed  64  chains  (22%)  

•  453  intermediaries  happened  to  be  involved  in  the  experiments  

•  Average  distance  of  the  completed  chains  was  6.2,  ranging  from  5.4  (Boston  group)  to  6.7  (random  Nebraska  inhabitants)  

•  Distance  6.7  was  5.7  degrees  of  separa.on  (thus  six  degrees  of  separa.on)   10  

Page 11: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

New  Milgram’s  experiment?    

•  Can  we  reproduce  Milgram’s  experiment  on  a  large  scale?  

•  How  can  one  compute  or  approximate  the  distance  distribu6on  of  a  given  huge  friendship  graph?  (such  as  Facebook)  

11  

Page 12: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Graph  distances  and  distribu.on  

•  The  distance  d(x,y)  is  the  length  of  the  shortest  path  from  x  to  y  – d(x,y)  =  ∞  if  one  cannot  go  from  x  to  y    

•  For  undirected  graphs  d(x,y)  =  d(y,x)  •  For  every  t,  count  the  number  of  pairs  (x,y)  such  that  d(x,y)  =  t  (distance  distribu.on)  

•  The  frac.on  of  pairs  at  distance  t  is  (the  density  func.on  of)  a  distribu.on  

12  

Page 13: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Previous  experiments  On-­‐line  social  networks:  •  6.6  degrees  of  separa.on  on  a  one-­‐month  MSN  Messenger  communica.on  graph  (Leskovec  and  Horvitz  [LH08])  – 180  M  nodes  and  1.3  G  arcs  

•  3.67  degrees  of  separa.on  in  Twioer  [KLPM10]    – 5  G  follows    –  Is  this  meaningful?  In  Twioer,  links  created  without  permission  at  both  ends…  

13  

Page 14: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

•  4.74  degrees  of  separa6on  in  Facebook  (Backstrom  et  al.  [Backstrom+12])  – 712  M  people  and  69  G  friendship  links  

14  

Page 15: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Hyper  ANF  

•  New  tool  for  studying/compu.ng  distance  distribu.on  of  very  large  graphs  

•  Diffusion-­‐based  approximated  algorithm  •  Based  on  WebGraph  (Boldi  et  al  [BV04])  and  ANF  [Palmer  et  al.,  2002]  

•  It  uses  HyperLogLog  counters  [Flajolet  et  al.,2007]  and  broadword  programming  for  low-­‐level  paralleliza.on  

15  

Page 16: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Compu.ng  Neighborhood  Func.on  •  Neighborhood  func.on  N(t):  for  each  t,  number  of  pairs  at  distance  ≤  t  –  provides  data  about  how  fast  the  “average  ball”  around  each  node  expands  

•  Many  breadth-­‐first  visits:  O(mn),  need  direct  access  L  •  Sampling:  a  frac.on  of  breadth-­‐first  visits,  very  unreliable  results  on  graphs  that  are  not  strongly  connected,  needs  direct  access  L  

•  Edith  Cohen’s  [JCSS  1997]  size  es.ma.on  framework:  very  powerful  (strong  theore.cal  guarantees)  but  does  not  scale  really  well,  needs  direct  access  L  

16  

Page 17: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Alterna.ve:  Diffusion  

•  Basic  idea:  Palmer  et  al  [PGF02]  •  Let  Bt(x)  be  the  ball  of  radius  t  around  x  (nodes  at  distance  at  most  t  from  x)  

•  Clearly  B0(x)={x}  •  But  also  Bt+1(x)  =  ∪x→y  Bt(y)∪{x}  •  So  we  can  compute  balls  by  enumera.ng  the  arcs  x→y  and  performing  unions  (on  sets)  

•  The  neighborhood  func.on  at  t  is  given  by  the  sum  of  the  sizes  of  the  balls  of  radius  t,                                                    i.e.,  Bt(x1),    Bt(x2),  …,  Bt(xn)  !  

17  

Page 18: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Easy  but  Costly:  Approximate  •  Every  set  needs  O(n)  bits:  overall  O(n2).  Too  many.  

•  All  we  need  is  cardinality  es.mators:  es.mate  #  of  dis.nct  elements  (nodes)  in  very  large  mul.sets  (balls  around  nodes),  presented  as  massive  streams  

•  Do  not  es.mate  cardinality  exactly:  use  probabilis.c  coun.ng  to  get  approximate  es.mates  

•  Wish  to  choose  approximate  sets  such  that  unions  can  be  computed  quickly  

18  

Page 19: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

ANF  and  HyperANF  ANF  [PGF02]:    •  Diffusion  with  Mar.n-­‐Flajolet  (MF)  counters  (log  n  +  c  space)  

•  MF  counters  can  be  combined  with  OR  HyperANF  [BRV11a]:    •  Diffusion  with  HyperLogLog  counters  [Flajolet+07]  (loglog  n  space)  

•  Use  broadword  programming  to  combine    HyperLogLog  counters  quickly!  

19  

Page 20: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

HyperLogLog  counters  

20  

Page 21: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

21  

Rough  Intui.on:  let  x  be  unknown  cardinality  of  M.    Each  substream  will  contain  approximately  (x/m)  different  elements.    Then,  its  Max-­‐parameter  should  be  close  to  log2(x/m).    Harmonic  mean  of  quan..es  2Max  (mZ  in  our  nota.on)  likely  to  be  of  the  order  of  (x/m).  Thus,  m2Z  should  be  of  the  order  of  x.    Constant  αm  to  correct  systema.c  mul.plica.ve  bias  in  m2Z  .  

Page 22: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

HyperLogLog  counters  •  Instead  of  actually  coun.ng,  observe  a  sta.s.cal  feature  of  a  set  (stream)  of  elements  

•  Feature:  #  of  trailing  zeroes  of  the  value  of  a  very  good  hash  func.on  

•  Keep  track  of  maximum  Max  (log  log  n  bits!)  •  #  of  dis.nct  elements  ∝  2Max  

•  Important:  counter  of  stream  AB  is  simply  maximum  of  counters  of  A  and  B!  

•  With  40  bits  can  count  up  to  4  billion  with  standard  devia.on  of  6%  

22  

Page 23: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Many  many  counters…  •  To  increase  confidence,  need  several  counters  (usually  2b,  b  ≥  4)  and  take  their  harmonic  mean  

•  Thus  each  set  is  represented  by  a  list  of  small  counters  

•  How  small?  Typically  5-­‐bit  is  enough:  232  >  1G.  For  huge  graphs,  7  bits  (unlikely  >  7  bits!)  

•  To  compute  the  union  of  two  sets  these  must  be  maximized  one-­‐by-­‐one  

•  Extrac.ng  by  shi�s,  maximizing  and  pu�ng  back  by  shi�s  is  unbearably  slow  

•  Mar.n–Flajolet  (ANF)  just  OR  the  features!  •  Exponen.al  reduc.on  in  space  (from  log  n  to  loglog  n)  but  unions  get  more  complicated….  

23  

Page 24: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

Page 25: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  

Page 26: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  

Page 27: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  1   0   1   0  

Page 28: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

1   1   1   1  

1   0   1   0  

Page 29: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

0   1   0  1  1   1   1   1  

Page 30: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

0   1   0  1  1   1   1   1  

1   1   1  1  0   0   0   0  

Page 31: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

0   1   0  1  1   1   1   1  

1   1   1  1  0   0   0   0  -­‐  

=  

Page 32: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

0   1   0  1  1   1   1   1  

1   1   1  1  0   0   0   0  -­‐  

=  127   0   127  0  1   0   1   0  

Page 33: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

0   1   0  1  1   1   1   1  

1   1   1  1  0   0   0   0  -­‐  

=  127   0   127  0  1   0   1   0  

Page 34: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Broadword  Programming  8  bits  

9     0     3     2    

7     3     3     6    

1   1   1   1  

0   0   0   0  -­‐  

=  2   125   0   124  

0   1   0  1  1   1   1   1  

1   1   1  1  0   0   0   0  -­‐  

=  127   0   127  0  1   0   1   0  

Page 35: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Real  speed?  

•  Large  size:  HADI  [Kang  et  al.,  2010]  is  a  Hadoop-­‐conscious  implementa.on  of  ANF.  Takes  30  minutes  on  a  200K-­‐node  graph  (on  one  of  the  50  world  largest  supercomputers).    

•  HyperANF  does  the  same  in  couple  of  minutes  on  a  worksta.on  (tens  min  on  a  laptop).  

35  

Page 36: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Experiments    

•  24-­‐core  machine  with:  – 72  GB  of  RAM  – 1  TB  of  disk  space  

 

Page 37: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Experiments  (.me)  

Ran  experiments  on  snapshots  of  facebook  •  Jan  1,  2007  •  Jan  1,  2008    •  ...  •  Jan  1,  2011  •  May,  2011  (721.1M  nodes,  68.7G  edges)  

37  

Page 38: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Experiments  (dataset)  

Considered:  •  �:  the  whole  facebook  graph  •  it  /  se:  only  Italian  /  Swedish  users  •  it+se:  only  Italian  &  Swedish  users  •  us:  only  US  users  Based  on  users’  current  geo-­‐IP  loca.on  

38  

Page 39: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Facebook  distance  distribu.on  

39  

4,74  

Page 40: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Distance  distribu.ons  

40  

Page 41: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Average  Distance  

41  

Page 42: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Average  Degree  and  Density  (�)  

42  Density  defined  as  2m  /  n  (n-­‐1)  

Page 43: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Diameter  (�)  

43  

Page 44: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

How  to  compute  the  diameter?  

44  

Lower  bounds  byproduct  of  HyperANF  runs      What  about  the  exact  diameter?    Computa.on  was  rela.vely  fast:  diameter  of  Facebook  required  10  hours  of  computa.on  on  machine  with  1TB  of  RAM  (although  256GB  would  have  been  sufficient)      

Page 45: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

The  WebGraph  Framework    

(Boldi  &  Vigna  2003)  

Page 46: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

“The”  Web  graph  

•  Set  U  of  URLs.  Directed  graph  with:  –  U  as  set  of  nodes  –  an  arc  from  x  to  y  iff  the  page  with  URL  x  has  a  hyperlink  poin.ng  to  URL  y.  

•  Transpose  graph:  reverse  all  arcs  (useful  for  several  advanced  ranking  algs,  i.e.,  HITS)  

•  Web  graph  is  HUGE!  (≈  100  M  nodes,  1  G  links)   46  

Page A hyperlink Page B Anchor

Page 47: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Storing  the  Web  graph  •  What  does  it  mean  “to  store  (part  of)  the  Web  graph”?  

–  being  able  to  know  the  successors  of  each  node  in  a  reasonable  .me  (e.g.,  much  less  than  1  ms/link)    

–  having  a  simple  way  to  know  the  node  corresponding  to  a  URL  (e.g.,  minimal  perfect  hash)  

–  having  a  simple  way  to  know  the  URL  corresponding  to  a  node  (e.g.,  front-­‐coded  lists).  

•  W.l.o.g.,assume  each  URL  represented  by  an  integer          (0,  1,  .  .  .  ,  n−1,  where  n  =  |U|)  

47  

Page 48: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Adjacency  lists  

•  The  set  of  neighbors  of  a  node  •  E.g.,  for  a  4  billion  page  web,  need  32  bits  per  node  (each  URL  represented  by  an  integer)  

•  Naively,  this  demands  64  bits  to  represent  each  hyperlink  

•  That’s  too  much:  Web  graph  with  1  G  links  would  require  more  than  64Gb  

Sec. 20.4

Page 49: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

(Some)  History  •  Connec.vity  Server  [Bharat+98]  ≈  32  bits/link  •  Algorithms  for  separable  graphs  [BBK03]  ≈  5  bits/link  

•  WebBase  [Cho+06]  ≈  5.6  bits/link  •  LINK  database  [Randall+02]  ≈  4.5  bits/link  •  WebGraph  [Boldi  Vigna  04]    ≈  3  bits/link  

       hop://webgraph.dsi.unimi.it/  

49  

Page 50: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Exploit  features  of  Web  Graphs  1.   Locality:   usually   most   links   in   a   page   have   naviga.onal  nature  and  so  are  local  (i.e.,  they  point  to  other  pages  on  the  same   host).   Source   and   target   of   those   links   share   a   long  common   prefix.   If   URLs   are   sorted   lexicographically,   the  index  of  source  and  target  are  close  to  each  other.  E.g.:.  – www.stanford.edu/alchemy  – www.stanford.edu/biology  – www.stanford.edu/biology/plant  – www.stanford.edu/biology/plant/copyright  – www.stanford.edu/biology/plant/people  – www.stanford.edu/chemistry  Literature  reports  that  on  average  80%  of  the  links  are  local    

Page 51: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Locality  Property:  with  URLs  lexicographically  ordered,  for  many  arcs  x→y  we  have  o�en  small  |x  −  y|    Improvement:  represent  the  successors  y1  <  y2  <  ·∙·∙·∙  <  yk  of  x  using  their  gaps  instead:  

y1  −  x,  y2  −  y1  −  1,  ...  ,  yk  −  yk−1  −  1  All  integers  non-­‐nega.ve,  except  for  first  one.  To  avoid  this,  use  following  coding  for  first  integer:  

51  

Page 52: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Naïve  Representa.on  and  with  Gaps  

52  

Page 53: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

2.  Similarity:  Pages  that  are  close  to  each  other  (in  lexicographic  order)  tend  to  have  many  common  successors.  This  is  because  many  naviga.onal  links  are  the  same  on  the  same  local  cluster  of  pages  (even  non-­‐naviga.onal  links  are  o�en  copied  from  one  page  to  another  on  the  same  host)  

Exploit  features  of  Web  Graphs  

Page 54: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Similarity  •  Property:  URLs  that  are  close  in  lexicographic  order  are  likely  to  belong  to  the  same  site  (probably  to  the  same  level  of  the  site  hierarchy)  

•  Consequence:  they  are  likely  to  have  similar  successor  lists  

•  Improvement:  code  successors  list  by  referen6a6on  –  a  reference  r  which  tells  us  to  start  from  the  list  of  x  −  r  (if  r  >  0)  

–  a  bit  string  which  tells  us  which  successors  must  be  copied  –  a  list  of  extra-­‐nodes  for  the  remaining  nodes  Choice  of  r  cri.cal:  r  chosen  as  value  between  0  and  W  (fixed  parameter,  windows  size)  that  gives  the  best  compression.  Larger  W  yields  beoer  compression  rate  but  slower  and  more  memory-­‐consuming  compression  and  decompression  54  

Page 55: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Reference  compression  

55  

Page 56: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Differen.al  compression  Look  at  copy  list  as  sequence  of  1-­‐blocks  and  0-­‐blocks  Length  of  each  block  decremented  by  1,  except  for  1st  block  1st  copy  block  always  assumed  to  be  a  1-­‐block  (so  1st  copy  block  is  0  if  copy  list  starts  with  0)    Last  block  always  omioed:  its  value  can  be  deduced  from  #  blocks  and  outdegree    

56  

Page 57: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Differen.al  compression  

57  

Copy  blocks  specify  by  inclusion/exclusion  sublists  that  must  be  alterna.vely  copied  or  discarded    Using  typical  codes,  such  as  γ coding,  copying  en.rely  a  list  costs  1  bit:  can  code  a  link  in  less  than  1  bit!  

Page 58: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Exploit  features  of  Web  Graphs  3.  Consecu6vity:  Many  links  within  same  page  are  likely  to   be   consecu.ve   (w.r.t.   lexicographic   order).   Mainly  due  to  two  dis.nct  phenomena:    1.  Most   pages   contain   sets   of   naviga.onal   links  which  

point   to   a   fixed   level   of   the   hierarchy.   Because   of  hierarchical  nature  of  URLs,  links  in  pages  at  booom  leve l   o f   h ierarchy   tend   to   be   ad jacent  lexicographically.  

2.  In  the  transposed  Web  graph:  pages  that  are  high  in  the  site  hierarchy  (e.g.,   the  home  page)  are  pointed  to  by  most  pages  of  the  site.  

   

Page 59: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Consecu.vity  

•  More  in  general,  consecu.vity  is  the  dual  of  distance-­‐one  similarity.  If  a  graph  is  easily  compressible  using  similarity  at  distance  one,  its  transpose  must  sport  large  intervals  of  consecu.ve  links,  and  viceversa.  

59  

Page 60: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Intervaliza.on  •   Exploit  consecu.vity  among  nodes.  Instead  of  compressing  them  directly  using  gaps,  first  isolate  subsequences  corresponding  to  integer  intervals  (only  intervals  with  length  above  threshold  Lmin  considered).  Compress  list  of  extra  nodes  as  follows:  – A  list  of  integer  intervals:  each  interval  represented  by  its  le�  extreme  and  its  length;  le�  extremes  compressed  thru  difference  with  previous  right  extreme  -­‐2  ,  interval  lengths  decremented  by  Lmin  

– A  list  of  residuals  (remaining  integers),  compressed  through  differences.    

60  

Page 61: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Lmin  =  2  Interval  [15,19]:  le�  extreme  15-­‐15=0,  length  5-­‐2=3  

Interval  [23,24]:  le�  extreme  23-­‐19-­‐2=2,  length  2-­‐2=0    Residuals  are  13,  203,  315,  1034,  which  gives  13-­‐15=-­‐2  (encoded  as  2|-­‐2|+1=5),  203-­‐13-­‐1=189,  315-­‐203-­‐1=111,  1034-­‐315-­‐1=718.  

61  

Page 62: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

62  

Intervaliza.on  

Page 63: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Choices  in  the  reference  scheme  •  How  do  you  choose  the  reference  node  for  x?  •  You  consider  the  successor  lists  of  the  last  W  nodes,  but.  .  .  you  do  not  consider  lists  which  would  cause  a  recursive  reference  of  more  than  R  (maximum  reference  count)  chains.  No  limit  on  R:  accessing  adjacency  list  of  x  may  require  a  decompression  of  all  lists  up  to  x!  

•  Tuning  parameter  R  essen.al  for  deciding  the  ra.o  compression/speed:  small  R  gives  worst  compression  but  shorter  (random)  access  .mes.  

•  W  essen.ally  decreases  compression  .me  only.  63  

Page 64: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Implementa.on  

•  Random  access  to  successor  lists  is  implemented  lazily  through  a  cascade  of  iterators.  

•  Each  series  of  interval  and  each  reference  cause  the  crea.on  of  an  iterator;  the  same  happens  for  references.  

•  The  results  of  all  iterators  are  then  merged.  •  The  advantage  of  laziness  is  that  we  never  have  to  build  an  actual  list  of  successors  in  memory,  so  the  overhead  is  limited  to  the  number  of  actual  reads,  not  to  the  number  of  successors  lists  that  would  be  necessary  to  re-­‐create  a  given  one.  

64  

Page 65: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Access  speed  •  Access  speed  to  a  compressed  graph  is  commonly  

measured  in  the  .me  required  to  access  a  link  (≈  300  ns  for  WebGraph).  

•  This  quan.ty,  however,  is  strongly  dependent  on  the  architecture  (e.g.,  cache  size),  and,  even  more,  on  low-­‐level  op.miza.ons  (e.g.,  hard-­‐coding  of  the  first  codewords  of  an  instantaneaous  code).  

•  To  compare  speeds  reliably,  we  need  public  data,  that  anyone  can  access,  and  a  common  framework  for  the  low-­‐level  opera.ons.  

•  A  first  step  is  hop://webgraph-­‐data.dsi.unimi.it/.  Freely  available  data  to  compare  compression  techniques.  

65  

Page 66: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Summary  WebGraph  provides  methods  to  manage  very  large  (web)  graphs.  It  consists  of:  •  ζ codes:  par.cularly  suitable  for  storing  web  graphs  (or  integers  with  power  law  distribu.on  in  a  certain  exponen.al  range)  

•  Algorithms  for  compressing  web  graphs  that  exploit  referen.a.on,  intervaliza.on  and  ζ codes  to  provide  high  compression  ra.o  

•  Algorithms  for  accessing  a  compressed  graph  without  actually  decompressing  it,  using  lazy  techniques  that  delay  decompression  un.l  it  is  actually  necessary  

66  

Page 67: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Conclusions  WebGraph  combines  new  codes,  new  insights  on  the  structure  of  the  Web  graph  and  new  algorithmic  techniques  to  achieve  a  very  high  compression  ra.o,  while  s.ll  retaining  a  good  access  speed  (can  it  be  beoer?).  So�ware  is  highly  tunable:  can  experiment  with  dozens  of  codes,  algorithmic  techniques  and  compression  parameters,  and  there  is  a  large  unexplored  space  of  combina.ons.  A  theore.cally  interes.ng  ques.on  is  how  to  combine  op.mally  differen.al  compression  and  intervaliza.on:  can  you  do  beoer  than  current  greedy  approach  (first  copy  as  much  as  you  can,  then  intervalize)?     67  

Page 68: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

Conclusions  The  compression  techniques  are  specialized  for  Web  Graphs.  

The  average  link  size  decreases  with  the  increase  of  the  graph.  

The  average  link  access  .me  increases  with  the  increase  of  the  graph.  

ζ-­‐codes  seems  to  achieve  great  trade-­‐off  between  avg.  bit  size  and  access  .me.  

Page 69: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

References  1/2  1.  [AJB00]  Albert,  Jeong,  and    Barabási.    Error  and  aoack  tolerance  of  complex  networks.  Nature,  

406:378–382,  2000.  2.  [Backstrom+12]  Backstrom,  Boldi,  Rosa,  Ugander,  and  Vigna.  2012.  Four  degrees  of  separa.on.  

In  Proceedings  of  the  3rd  Annual  ACM  Web  Science  Conference  (WebSci  '12)  3.  [BBK03]  Blandford,  Blelloch  and  Kash.  2003.  Compact  representa.ons  of  separable  graphs.  

In  Proceedings  of  the  14th  ACM-­‐SIAM  symposium  on  Discrete  algorithms  (SODA  '03)  4.  [Bharat+98]  Bharat,  Broder,  Henzinger,  Kumar,  and  Venkatasubramanian.  1998.  The  connec.vity  

server:  fast  access  to  linkage  informa.on  on  the  Web.  In  Proceedings  of  the  seventh  interna.onal  conference  on  World  Wide  Web  7  (WWW7)  

5.  [Bra01]  Brandes.  A  faster  algorithm  for  betweenness  centrality.  Journal  of  Mathema.cal  Sociology,  25(2):163–177,  2001.  

6.  [BRV11a]  Boldi,  Rosa,  and  Vigna.  2011.  HyperANF:  approxima.ng  the  neighbourhood  func.on  of  very  large  graphs  on  a  budget.  In  Proceedings  of  the  20th  interna.onal  conference  on  World  wide  web  (WWW  '11)  

7.  [BRV11b]  Boldi,  Rosa,  and  Vigna.  2011.  Robustness  of  social  networks:  compara.ve  results  based  on  distance  distribu.ons.  In  Proceedings  of  the  Third  interna.onal  conference  on  Social  informa.cs  (SocInfo'11)    

8.  [BV04]  Boldi  and  Vigna.  2004.  The  webgraph  framework  I:  compression  techniques.  In  Proceedings  of  the  13th  interna.onal  conference  on  World  Wide  Web  (WWW  '04)  

9.  [Bor05]  Borga�.  Centrality  and  network  flow.  Social  Networks,  27(1):55–71,  2005.  10.  [Cho+06]  Cho,  Garcia-­‐Molina,  Haveliwala,  Lam,  Paepcke,  Raghavan,  and  Wesley.  2006.  Stanford  

WebBase  components  and  applica.ons.  ACM  Trans.  Internet  Technol.  6  69  

Page 70: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

References  2/2  11.  [Crescenzi+12]  Crescenzi,  Grossi,    Habib,  Lanzi,  and  Marino.  On  compu.ng  the  diameter  of  real-­‐

world  undirected  graphs.  Theore.cal  Computer  Science,  2012.  12.  [Flajolet+07]  Flajolet  ,  Fusy  ,  Gandouet  and  Meunier.  2008.  HyperLogLog:  the  analysis  of  a  near-­‐

op.mal  cardinality  es.ma.on  algorithm.  In  Proceedings  of  the  2007  interna.onal  conference  on  Analysis  of  Algorithms  (AOFA  ’07)  

13.  [Firmani+12]  Firmani,  Italiano,  Laura,  Orlandi,  and  Santaroni.  2012.  Compu.ng  strong  ar.cula.on  points  and  strong  bridges  in  large  scale  graphs.  InProceedings  of  the  11th  interna.onal  conference  on  Experimental  Algorithms  (SEA'12)  

14.  [Italiano+12]  Italiano,  Laura,  Santaroni,  “Finding  strong  ar.cula.on  points  and  strong  bridges  in  linear  .me”,  Theore.cal  Computer  Science,  vol.  447,  74–84,  2012.  

15.   [KLPM10]  Kwak,  Lee,  Park,  and  Moon.  2010.  What  is  Twioer,  a  social  network  or  a  news  media?.  In  Proceedings  of  the  19th  interna.onal  conference  on  World  wide  web  (WWW  '10)  

16.  [LH08]  Leskovec  and  Horvitz.  2008.  Planetary-­‐scale  views  on  a  large  instant-­‐messaging  network.  In  Proceedings  of  the  17th  interna.onal  conference  on  World  Wide  Web  (WWW  '08)  

17.  [Mislove+07]  Mislove,  Marcon,  Gummadi,  Druschel,  and  Bhaoacharjee.  2007.  Measurement  and  analysis  of  online  social  networks.  In  Proceedings  of  the  7th  ACM  SIGCOMM  conference  on  Internet  measurement  (IMC  '07)  

18.  [PGF02]  Palmer,  Gibbons,  and  Faloutsos.  2002.  ANF:  a  fast  and  scalable  tool  for  data  mining  in  massive  graphs.  In  Proceedings  of  the  eighth  ACM  SIGKDD  interna.onal  conference  on  Knowledge  discovery  and  data  mining  (KDD  '02)  

19.  [Randall+02]    Randall,  Stata,    Wiener,  and  Wickremesinghe.  2002.  The  Link  Database:  Fast  Access  to  Graphs  of  the  Web.  In  Proceedings  of  the  Data  Compression  Conference  (DCC  '02)  

20.  [RAK07]  Raghavan,  Albert,  and  Kumara.  Near  linear  .me  algorithm  to  detect  community  structures  in  large-­‐scale  networks.  Physical  Review  E  (Sta.s.cal,  Nonlinear,  and  So�  Maoer  Physics),  76(3),  2007.  

70