machine learning in the cloud with graphlab

49
Danny Bickson Machine Learning in the Cloud with GraphLab Applied machine learning day, January 20, 2014 MS

Upload: danny-bickson

Post on 11-Nov-2014

905 views

Category:

Technology


3 download

DESCRIPTION

Talk by Dr. Danny Bickson, GraphLab Inc. at the Applied Machine Learning Day, January 20, 2014 @ MS

TRANSCRIPT

Page 1: Machine Learning in the Cloud with GraphLab

Danny  Bickson  

Machine  Learning  in  the  Cloud  with  GraphLab  

Applied  machine  learning  day,  January  20,  2014  MS  

Page 2: Machine Learning in the Cloud with GraphLab

Needless  to  Say,  We  Need  Machine  Learning  for  Big  Data  

72  Hours  a  Minute  YouTube  28  Million    

Wikipedia  Pages  

1  Billion  Facebook  Users  

6  Billion    Flickr  Photos  

“…  data  a  new  class  of  economic  asset,  like  currency  or  gold.”  

Page 3: Machine Learning in the Cloud with GraphLab

How  will  we  design  and  implement    

parallel  learning  systems?    

Big  Learning  

Page 4: Machine Learning in the Cloud with GraphLab

A  ShiU  Towards  Parallelism  

GPUs Multicore Clusters Clouds Supercomputers

!    ML  experts      repeatedly  solve  the  same  parallel  design  challenges:  ! Race  condiZons,  distributed  state,  communicaZon…    

! The  resulZng  code  is:  ! difficult  to  maintain,  extend,  debug…    

Graduate students

Avoid  these  problems  by  using    high-­‐level  abstrac4ons  

Page 5: Machine Learning in the Cloud with GraphLab

MapReduce  for  Data-­‐Parallel  ML  

! Excellent  for  large  data-­‐parallel  tasks!  

Data-Parallel Graph-Parallel

Cross  ValidaZon  

Feature    ExtracZon  

MapReduce  

CompuZng  Sufficient  StaZsZcs    

Graphical  Models  Gibbs  Sampling  

Belief  PropagaZon  VariaZonal  Opt.  

Semi-­‐Supervised    Learning  

Label  PropagaZon  CoEM  

Graph  Analysis  PageRank  

Triangle  CounZng  

Collabora4ve    Filtering  

Tensor  FactorizaZon  

Is  there  more  to  Machine  Learning  

?  

Page 6: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

The  Power  of  Dependencies  

 where  the  value  is!  

Page 7: Machine Learning in the Cloud with GraphLab

Label  a  Face  and  Propagate  

Page 8: Machine Learning in the Cloud with GraphLab

Pairwise  similarity  not  enough…  

Not similar enough to be sure

Page 9: Machine Learning in the Cloud with GraphLab

Propagate  SimilariZes  &  Co-­‐occurrences  for  Accurate  PredicZons    

similarity  edges  

co-­‐occurring  faces  

further  evidence  

Page 10: Machine Learning in the Cloud with GraphLab

CollaboraZve  Filtering:  Independent  Case  

Lord  of  the  Rings  

Star  Wars  IV  

Star  Wars  I  

Harry  Poder  

Pirates  of  the  Caribbean    

Page 11: Machine Learning in the Cloud with GraphLab

CollaboraZve  Filtering:  ExploiZng  Dependencies  

City  of  God  

Wild  Strawberries  

The  CelebraZon  

La  Dolce  Vita  

Women  on  the  Verge  of  a  Nervous  Breakdown  

What  do  I    recommend???  

Page 12: Machine Learning in the Cloud with GraphLab

Data

Machine  Learning  Pipeline  

images    

docs    

movie    raZngs  

Extract Features

faces    

important  words  

 

side    info  

Graph Formation

similar  faces  

 

shared  words  

 

rated  movies  

Structured Machine Learning Algorithm

belief  propagaZon  

 

LDA    

collaboraZve  filtering  

Value from Data

face  labels  

 

doc  topics  

 

movie  recommend.  

Page 13: Machine Learning in the Cloud with GraphLab

Data

Parallelizing  Machine  Learning  

Extract Features

Graph Formation Structured

Machine Learning Algorithm

Value from Data

Graph  Ingress  mostly  data-­‐parallel  

Graph-­‐Structured  Computa4on  graph-­‐parallel  

Page 14: Machine Learning in the Cloud with GraphLab

ML  Tasks  Beyond  Data-­‐Parallelism    

Data-Parallel Graph-Parallel

Cross  ValidaZon  

Feature    ExtracZon  

Map  Reduce  

CompuZng  Sufficient  StaZsZcs    

Graphical  Models  Gibbs  Sampling  

Belief  PropagaZon  VariaZonal  Opt.  

Semi-­‐Supervised    Learning  

Label  PropagaZon  CoEM  

Graph  Analysis  PageRank  

Triangle  CounZng  

Collabora4ve    Filtering  

Tensor  FactorizaZon  

Page 15: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

Example  of  a  Graph-­‐Parallel  Algorithm  

Page 16: Machine Learning in the Cloud with GraphLab

PageRank  

What’s the rank of this user?

Rank?  

Depends on rank of who follows her

Depends on rank of who follows them…

Loops  in  graph  è  Must  iterate!  

Page 17: Machine Learning in the Cloud with GraphLab

PageRank  IteraZon  

! α  is  the  random  reset  probability ! wji  is  the  prob.  transiZoning  (similarity)  from  j  to  i

R[i] = ↵+ (1� ↵)X

(j,i)2E

wjiR[j]R[i]  

R[j]  wji   Iterate  unZl  convergence:  

“My  rank  is  weighted    average  of  my  friends’  ranks”  

Page 18: Machine Learning in the Cloud with GraphLab

ProperZes  of  Graph  Parallel  Algorithms  

Dependency  Graph  

IteraZve  ComputaZon  

My  Rank  

Friends  Rank  

Local  Updates  

Page 19: Machine Learning in the Cloud with GraphLab

Addressing  Graph-­‐Parallel  ML  

Data-Parallel Graph-Parallel

Cross  ValidaZon  

Feature    ExtracZon  

Map  Reduce  

CompuZng  Sufficient  StaZsZcs    

Graphical  Models  Gibbs  Sampling  

Belief  PropagaZon  VariaZonal  Opt.  

Semi-­‐Supervised    Learning  

Label  PropagaZon  CoEM  

Data-­‐Mining  PageRank  

Triangle  CounZng  

Collabora4ve    Filtering  

Tensor  FactorizaZon  

Map  Reduce?  Graph-­‐Parallel  AbstracZon  

Page 20: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

Page 21: Machine Learning in the Cloud with GraphLab

Data  Graph  Data  associated  with  verZces  and  edges  

Vertex  Data:  •   User  profile  text  •   Current  interests  esZmates  

Edge  Data:  •   Similarity  weights    

Graph:  •   Social  Network  

Page 22: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

How  do  we  program    graph  computaZon?  

“Think  like  a  Vertex.”  -­‐Malewicz  et  al.  [SIGMOD’10]  

Page 23: Machine Learning in the Cloud with GraphLab

pagerank(i,  scope){      //  Get  Neighborhood  data      (R[i],  wij,  R[j])  ßscope;    

     //  Update  the  vertex  data            //  Reschedule  Neighbors  if  needed      if  R[i]  changes  then            reschedule_neighbors_of(i);    }  

R[i]←α + (1−α) wji ×R[ j]j∈N [i]∑ ;

Update  FuncZons  User-­‐defined  program:  applied  to    vertex  transforms  data  in  scope  of  vertex  

Dynamic    computa4on  

Update  funcZon  applied  (asynchronously)    in  parallel  unZl  convergence  

 Many  schedulers  available  to  prioriZze  computaZon  

Page 24: Machine Learning in the Cloud with GraphLab

The  GraphLab  Framework  

Scheduler   Consistency  Model  

Graph  Based  Data  Representa4on  

Update  FuncZons  User  Computa4on  

Page 25: Machine Learning in the Cloud with GraphLab

Bayesian  Tensor    FactorizaZon  

Gibbs  Sampling  Dynamic  Block  Gibbs  Sampling  

Matrix  FactorizaZon  

Lasso  

SVM  

Belief  PropagaZon   PageRank  

CoEM  

K-­‐Means  

SVD  

LDA  

…Many  others…  Linear  Solvers  

Splash  Sampler  AlternaZng  Least    

Squares  

Page 26: Machine Learning in the Cloud with GraphLab

Never  Ending  Learner  Project  (CoEM)  

Hadoop   95  Cores   7.5  hrs  

Distributed  GraphLab  

32  EC2  machines  

80  secs  

0.3% of Hadoop time

2 orders of mag faster è 2 orders of mag cheaper

Page 27: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

GraphLab  1  provided  exciZng  scaling  performance  

But…  

Thus  far…  

We  couldn’t  scale  up  to    Altavista  Webgraph  2002  1.4B  ver4ces,  6.7B  edges  

Page 28: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

Natural  Graphs  

[Image  from  WikiCommons]  

Page 29: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

Problem:  ExisZng  distributed  graph  

computaZon  systems  perform  poorly  on  Natural  Graphs  

Page 30: Machine Learning in the Cloud with GraphLab

Achilles  Heel:      Idealized  Graph  AssumpZon  

Assumed…   But,  Natural  Graphs…  

Small  degree  è    Easy  to  parZZon   Many  high  degree  verZces  

(power-­‐law  degree  distribuZon)    è    

Very  hard  to  parZZon  

Page 31: Machine Learning in the Cloud with GraphLab

Power-­‐Law  Degree  DistribuZon  

100 102 104 106 108100

102

104

106

108

1010

degree

count

High-­‐Degree    VerZces:    

1%  verZces  adjacent  to  50%  of  edges    

Num

ber  o

f  VerZces  

AltaVista  WebGraph  1.4B  VerZces,  6.6B  Edges  

Degree  

Page 32: Machine Learning in the Cloud with GraphLab

High  Degree  VerZces  are  Common  

Users  

Movies  

NeQlix  

“Social”  People   Popular  Movies  

θ Z w Z w Z w Z w

θ Z w Z w Z w Z w

θ Z w Z w Z w Z w

θ Z w Z w Z w Z w

B α

Hyper  Parameters  

Docs  

Words  

LDA  

Common  Words  

Obama  

Page 33: Machine Learning in the Cloud with GraphLab

Power-­‐Law  Graphs  are    Difficult  to  Par44on  

! Power-­‐Law  graphs  do  not  have  low-­‐cost  balanced  cuts  [Leskovec  et  al.  08,  Lang  04]  

! TradiZonal  graph-­‐parZZoning  algorithms  perform  poorly  on  Power-­‐Law  Graphs.  [Abou-­‐Rjeili  et  al.  06]  

33  

CPU 1 CPU 2

Page 34: Machine Learning in the Cloud with GraphLab

Machine 1 Machine 2

! Split  High-­‐Degree  verZces  ! New  Abstrac4on  à  Leads  to  this  Split  Vertex  Strategy  

Program  For  This  

Run  on  This  

GraphLab  2  Solu4on  

Page 35: Machine Learning in the Cloud with GraphLab

GAS  DecomposiZon  Y  

+  …  +            à    

Y  

Parallel  “Sum”  

Y  

Gather  (Reduce)  Apply  the  accumulated    value  to  center  vertex  

Apply  Update  adjacent  edges  

and  verZces.  

Scader  

Accumulate  informaZon  about  neighborhood  

Y  

+    

Y  Σ Y’   Y’  

Page 36: Machine Learning in the Cloud with GraphLab

GraphChi:  Going  small  with  GraphLab  

Solve  huge  problems  on  small  or  embedded  

devices?  

Key:  Exploit  non-­‐volaZle  memory    (starZng  with  SSDs  and  HDs)  

6. Before

8. A!er

7. A!er

Page 37: Machine Learning in the Cloud with GraphLab

GraphChi  –  disk-­‐based  GraphLab  

Challenge:          Random  Accesses  

Novel  GraphChi  solu4on:          Parallel  sliding  windows  method  è            minimizes  number  of  random  accesses  

Page 38: Machine Learning in the Cloud with GraphLab

GraphChi  –  disk-­‐based  GraphLab  

Novel  Parallel  Sliding    Windows  algorithm  

! Fast!  ! Solves  tasks  as  large  as  current  distributed  systems  

! Minimizes  non-­‐sequenZal  disk  accesses    ! Efficient  on  both  SSD  and  hard-­‐drive  

! Parallel,  asynchronous  execuZon  

Page 39: Machine Learning in the Cloud with GraphLab

Sample  Results  Triangle  Coun4ng   Belief  Propaga4on  

0   100   200   300   400   500  

Hadoop  -­‐  1600  nodes  [1]  

GraphChi  -­‐  1  Mac  Mini  

TwiYer  graph  (1.5B  edges)  

0   5   10   15   20   25   30  

Hadoop  -­‐  100  machines  [2]  

GraphChi  -­‐  1  Mac  Mini  

Altavista  Graph  (6.7B  edges)  

minutes  

[2]  U.  Kang,  D.  H.  Chau,  and  C.  Faloutsos.  Inference  of  Beliefs  on  Billion-­‐Scale  Graphs.  KDD-­‐LDMTA’10,  pages  1–7,  June  2010.    

[1]  S.  Suri  and  S.  Vassilvitskii.  CounZng  triangles  and  the  curse  of  the  last  reducer.  WWW’  2011  

minutes  

Page 40: Machine Learning in the Cloud with GraphLab

Triangle  CounZng  on  Twider  Graph  40M  Users      1.2B  Edges  

Total:  34.8  Billion  Triangles  

Hadoop results from [Suri & Vassilvitskii WWW ‘11]  

59  Minutes  

64  Machines,  1024  Cores  1.5  Minutes  

GraphLab2  

GraphChi  

Hadoop  

1636  Machines  423  Minutes  

59  Minutes,  1  Mac  Mini!  

Page 41: Machine Learning in the Cloud with GraphLab

Carnegie Mellon University

Efficient  MulZcore  CollaboraZve  Filtering  

LeBuSiShu  team  –    5th  place  in  track1,  ACM  KDD  CUP  2011  

InsZtute  of  AutomaZon  Chinese  Academy  of  Sciences  

Machine  Learning  Dept  Carnegie  Mellon  University  

ACM  KDD  CUP  Workshop  2011  

Yao  Wu   Qiang  Yan   Danny  Bickson   Yucheng  Low  Qing  Yang  

Page 42: Machine Learning in the Cloud with GraphLab

Neylix  CollaboraZve  Filtering  

! AlternaZng  Least  Squares  Matrix  FactorizaZon  

 Model:  0.5  million  nodes,  99  million  edges    

4 8 16 24 32 40 48 56 64101

102

103

104

#Nodes

Runtim

e(s) Hadoop MPI

GraphLab

Hadoop  MPI  

GraphLab  

Page 43: Machine Learning in the Cloud with GraphLab

Intel  Labs  Report  on  GraphLab  

Data  source:  Nezih  Yigitbasi,  Intel  Labs  

Page 44: Machine Learning in the Cloud with GraphLab

ACM  KDD  CUP  2012  

Page 45: Machine Learning in the Cloud with GraphLab

GraphLab  team  @  WSDM  13  

Page 46: Machine Learning in the Cloud with GraphLab

Future  Plans  

Page 47: Machine Learning in the Cloud with GraphLab

Learn:    GraphLab  Notebook  

Prototype:    pip  install  graphlab    

è �local  prototyping  

ProducZon:    Same  code  scales  -­‐      execute  on  EC2  

cluster  

Future  Plans  

Page 48: Machine Learning in the Cloud with GraphLab

GraphLab  Internship  Plan  

Page 49: Machine Learning in the Cloud with GraphLab

GraphLab  Conferences  

2012                          è                    2013