carnegie mellon joseph gonzalez joint work with yucheng low aapo kyrola danny bickson carlos...
TRANSCRIPT
Carnegie Mellon
Joseph GonzalezJoint work with
YuchengLow
AapoKyrola
DannyBickson
CarlosGuestrin
GuyBlelloch
JoeHellerstein
DavidO’Hallaron
A New Parallel Framework for Machine Learning
AlexSmola
A
BC
D
Originates From
Is the driver
hostile?
C
Lives
Patient presents
abdominal pain.
Diagnosis?
Patient ate
which contains
purchasedfrom
Also sold
to
Diagnoses
withE. Coli
infection
4
Cameras Cooking
Shopper 1 Shopper 2
The Hollywood Fiction…Mr. Finch develops software which:
• Runs in “consolidated” data-center with access to all government data
• Processes multi-modal data• Video Surveillance• Federal and Local Databases• Social Networks• …
• Uses Advanced Machine Learning • Identify connected patterns• Predict catastrophic events
…how far is this from reality?
6
Big Data is a reality
48 Hours a MinuteYouTube
24 Million Wikipedia Pages
750 MillionFacebook Users
6 Billion Flickr Photos
Machine learning is a reality
8
MachineLearning
Understanding
Linear Regression
xxx
xxx
x
x
x
x
Raw Data
Limited to Simplistic Models Fail to fully utilize the data
Substantial System Building EffortSystems evolve slowly and are costly
9
Big Data
+Large-Scale
Compute Clusters
+
We have mastered:
Simple Machine Learning
xxx
xxx
x
x
x
x
Advanced Machine Learning
10
Raw DataMachineLearning
Understanding
Mubarak Obama Netanyahu Abbas
Deep Belief / NeuralNetworks
Markov Random Fields
Needs
Supports
Cooperate
Distrusts
Cameras Cooking
Data dependencies substantiallycomplicate parallelization
Challenges of Learning at ScaleWide array of different parallel architectures:
New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocksManaging distributed model stateData-Locality and efficient inter-process coordination
New Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingFault Tolerance
11
GPUs Multicore Clusters Mini Clouds Clouds
Rich Structured Machine Learning Techniques Capable of fully modeling the data dependencies
Goal: Rapid System DevelopmentQuickly adapt to new data, priors, and objectives Scale with new hardware and system advances
12
Big Data
+Large-Scale
Compute Clusters
+
The goal of the GraphLab project …
AdvancedMachine Learning
OutlineImportance of Large-Scale Machine Learning
Need to model data-dependencies
Existing Large-Scale Machine Learning AbstractionsNeed for a efficient graph structured abstraction
GraphLab Abstraction:Addresses data-dependences Enables the expression of efficient algorithms
Experimental ResultsGraphLab dramatically outperforms existing abstractions
Open Research Challenges
How will wedesign and implement
parallel learning systems?
Threads, Locks, & Messages
“low level parallel primitives”
We could use ….
Threads, Locks, and MessagesML experts repeatedly solve the same parallel design challenges:
Implement and debug complex parallel systemTune for a specific parallel platform6 months later the conference paper contains:
“We implemented ______ in parallel.”
The resulting code:is difficult to maintainis difficult to extendcouples learning model to parallel implementation
16
Graduate
students
Map-Reduce / HadoopBuild learning algorithms on-top of
high-level parallel abstractions
... a better answer:
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
18
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
19
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
Image Features
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
20
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
CPU 1 CPU 2
MapReduce – Reduce Phase
21
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Image Features
Attractive Face Statistics
Ugly Face Statistics
U A A U U U A A U A U A
Attractive Faces Ugly Faces
BeliefPropagation
Label Propagation
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
22
Data-Parallel Graph-Parallel
Algorithm Tuning
Feature Extraction
Map Reduce
Basic Data Processing
Is there more toMachine Learning
?
Concrete Example
Label Propagation
Profile
Label Propagation AlgorithmSocial Arithmetic:
Recurrence Algorithm:
iterate until convergence
Parallelism:Compute all Likes[i] in parallel
Sue Ann
Carlos
Me
50% What I list on my profile40% Sue Ann Likes10% Carlos Like
40%
10%
50%
80% Cameras20% Biking
30% Cameras70% Biking
50% Cameras50% Biking
I Like:
+60% Cameras, 40% Biking
Properties of Graph Parallel Algorithms
DependencyGraph
IterativeComputation
What I Like
What My Friends Like
Factored Computation
?
BeliefPropagation
Label Propagation
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
26
Data-Parallel Graph-Parallel
Map Reduce Map Reduce?Algorithm
TuningFeature
Extraction
Basic Data Processing
Why not use Map-Reducefor
Graph Parallel Algorithms?
Data Dependencies
Map-Reduce does not efficiently express data dependencies
User must code substantial data transformations Costly data replication
Inde
pend
ent D
ata
Row
s
Slow
Proc
esso
rIterative Algorithms
Map-Reduce not efficiently express iterative algorithms:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapAbuse: Iterative MapReduceOnly a subset of data needs computation:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapAbuse: Iterative MapReduceSystem is not optimized for iteration:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Disk Pe
nalty
Disk Pe
nalty
Disk Pe
nalty
Sta
rtup
Pen
alty
Sta
rtup
Pen
alty
Sta
rtup
Pen
alty
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
32
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Map Reduce?Bulk Synchronous?
Barrie
rBulk Synchronous Parallel (BSP)
Implementations: Pregel, Giraph, …
Compute Communicate
Bulk synchronous computation can be highly inefficient.
34
Problem
Problem with Bulk SynchronousExample Algorithm: If Red neighbor then turn Red
Bulk Synchronous Computation :Evaluate condition on all vertices for every phase
4 Phases each with 9 computations 36 Computations
Asynchronous Computation (Wave-front) :Evaluate condition only when neighbor changes
4 Phases each with 2 computations 8 Computations
Time 0 Time 1 Time 2 Time 3 Time 4
36
Real-World Example: Loopy Belief Propagation
Loopy Belief Propagation (Loopy BP)
• Iteratively estimate the “beliefs” about vertices– Read in messages– Updates marginal
estimate (belief)– Send updated
out messages• Repeat for all variables
until convergence
37
Bulk Synchronous Loopy BP
• Often considered embarrassingly parallel – Associate processor
with each vertex– Receive all messages– Update all beliefs– Send all messages
• Proposed by:– Brunton et al. CRV’06– Mendiburu et al. GECC’07– Kang,et al. LDMTA’10– …
38
Sequential Computational Structure
39
Hidden Sequential Structure
40
Hidden Sequential Structure
• Running Time:
EvidenceEvidence
Time for a singleparallel iteration
Number of Iterations
41
Optimal Sequential Algorithm
Forward-Backward
Bulk Synchronous
2n2/p
p ≤ 2n
RunningTime
2n
Gap
p = 1
Optimal Parallel
n
p = 2 42
43
The Splash Operation• Generalize the optimal chain algorithm:
to arbitrary cyclic graphs:
~
1) Grow a BFS Spanning tree with fixed size
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
Data-Parallel Algorithms can be Inefficient
1 2 3 4 5 6 7 80
100020003000400050006000700080009000
Number of CPUs
Runti
me
in S
econ
ds
Optimized in Memory Bulk Synchronous
Asynchronous Splash BP
Summary of Work Efficiency
Bulk Synchronous Model Not Work Efficient!Compute “messages” before they are readyIncreasing processors increase the overall workCosts CPU time and Energy!
How do we recover work efficiency?Respect sequential structure of computationCompute “message” as needed: asynchronously
BeliefPropagationSVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism
46
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Bulk Synchronous
OutlineImportance of Large-Scale Machine Learning
Need to model data-dependencies
Existing Large-Scale Machine Learning AbstractionsNeed for a efficient graph structured abstraction
GraphLab Abstraction:Addresses data-dependences Enables the expression of efficient algorithms
Experimental ResultsGraphLab dramatically outperforms existing abstractions
Open Research Challenges
What is GraphLab?
The GraphLab Abstraction
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
49
Data Graph
50
A graph with arbitrary data (C++ Objects) associated with each vertex and edge.
Vertex Data:• User profile text• Current interests estimates
Edge Data:• Similarity weights
Graph:• Social Network
Implementing the Data GraphMulticore Setting
In MemoryRelatively Straight Forward
vertex_data(vid) dataedge_data(vid,vid) dataneighbors(vid) vid_list
Challenge:Fast lookup, low overhead
Solution:Dense data-structuresFixed Vdata & Edata typesImmutable graph structure
Cluster Setting
In MemoryPartition Graph:
ParMETIS or Random Cuts
Cached Ghosting
Node 1 Node 2
A B
C D
A B
C D
A B
C D
The GraphLab Abstraction
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
52
label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }
Update Functions
53
An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
The GraphLab Abstraction
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
54
The Scheduler
55
CPU 1
CPU 2
The scheduler determines the order that vertices are updated.
e f g
kjih
dcba b
ih
a
i
b e f
j
c
Sch
edule
r
The process repeats until the scheduler is empty.
Choosing a Schedule
GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed orderFIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order
56
The choice of schedule affects the correctness and parallel performance of the algorithm
Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority Optimal Splash BP
Algorithm
The GraphLab Abstraction
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
58
Ensuring Race-Free CodeHow much can computation overlap?
Importance of ConsistencyMany algorithms require strict consistency or perform
significantly better under strict consistency.
Alternating Least Squares
Importance of Consistency
Machine learning algorithms require “model debugging”
Build
Test
Debug
Tweak Model
GraphLab Ensures Sequential Consistency
62
For each parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
CPU 1 CPU 2
Common Problem: Write-Write Race
63
Processors running adjacent update functions simultaneously modify shared data:
CPU1 writes: CPU2 writes:
Final Value
Consistency Rules
64
Guaranteed sequential consistency for all update functions
Data
Full Consistency
65
Obtaining More Parallelism
66
Edge Consistency
67
CPU 1 CPU 2
Safe
Read
Consistency Through R/W LocksRead/Write locks:
Full Consistency
Edge Consistency
Write Write WriteCanonical Lock Ordering
Read Write ReadRead Write
The GraphLab Abstraction
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
71
The Code
API Implemented in C++:Pthreads, GCC Atomics, TCP/IP, MPI, in house RPC
Multicore APIMatlab/Java/Python supportAvailable under Apache 2.0 License
Cloud APIBuilt and tested on EC2No Fault Tolerance
http://graphlab.org
Anatomy of a GraphLab Program:
1) Define C++ Update Function2) Build data graph using the C++ graph object3) Set engine parameters:
1) Scheduler type 2) Consistency model
4) Add initial vertices to the scheduler 5) Run the engine on the graph [Blocking C++ call]6) Final answer is stored in the graph
Carnegie Mellon
Bayesian Tensor Factorization
Gibbs Sampling
Dynamic Block Gibbs Sampling
MatrixFactorization
Lasso
SVM
Belief Propagation
PageRank
CoEM
K-Means
SVD
LDA
…Many others…
Startups Using GraphLab
Companies experimenting with Graphlab
Academic projects Exploring Graphlab
1600++ Unique Downloads Tracked(possibly many more from direct repository checkouts)
GraphLab Matrix Factorization Toolkit
Used in ACM KDD Cup 2011 – track1 5th place out of more than 1000 participants.2 orders of magnitude faster than Mahout
Testimonials:“The Graphlab implementation is significantly faster than the Hadoop implementation … [GraphLab] is extremely efficient for networks with millions of nodes and billions of edges …” -- Akshay Bhat, Cornell
“The guys at GraphLab are crazy helpful and supportive … 78% of our value comes from motivation and brilliance of these guys.” -- Timmy Wilson, smarttypes.org
“I have been very impressed by Graphlab and your support/work on it.” -- Clive Cox, rumblelabs.com
OutlineImportance of Large-Scale Machine Learning
Need to model data-dependencies
Existing Large-Scale Machine Learning AbstractionsNeed for a efficient graph structured abstraction
GraphLab Abstraction:Addresses data-dependences Enables the expression of efficient algorithms
Experimental ResultsGraphLab dramatically outperforms existing abstractions
Open Research Challenges
Shared MemoryExperiments
Shared Memory Setting16 Core Workstation
78
Loopy Belief Propagation
79
3D retinal image denoising
Data GraphUpdate Function:
Loopy BP Update EquationScheduler:
Approximate PriorityConsistency Model:
Edge Consistency
Vertices: 1 MillionEdges: 3 Million
Loopy Belief Propagation
80
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Optimal
Bett
er
SplashBP
15.5x speedup
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal?Is “Catalina” a place?
Vertices: 2 MillionEdges: 200 Million
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bett
er
Optimal
GraphLab CoEM
CoEM (Rosie Jones, 2005)
82
GraphLab 16 Cores 30 min
15x Faster!6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
ExperimentsAmazon EC2
High-Performance Nodes
83
Video Cosegmentation
Segments mean the same
Model: 10.5 million nodes, 31 million edges
Gaussian EM clustering + BP on 3D grid
Video Coseg. Speedups
Prefetching Data & Locks
Matrix FactorizationNetflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
NetflixSpeedup Increasing size of the matrix factorization
Distributed GraphLab
The Cost of Hadoop
OutlineImportance of Large-Scale Machine Learning
Need to model data-dependencies
Existing Large-Scale Machine Learning AbstractionsNeed for a efficient graph structured abstraction
GraphLab Abstraction:Addresses data-dependences Enables the expression of efficient algorithms
Experimental ResultsGraphLab dramatically outperforms existing abstractions
Open Research Challenges
Storage of Large Data-GraphsFault tolerance to machine/network failure
Can I remove (re-task) a node or network resources without restarting dependent computation?
Relaxed transactional consistencyCan I eliminate locking and approximately recover when data corruption occurs?
Support rapid vertex and edge additionHow can I allow graphs to continuously grow while computation proceeds?
Graph partitioning for “natural graphs” How can I balance the computation while minimizing communication on a power-law graph?
Event driven graph computationTrigger computation on data and structural modifications
Exploit small neighborhood effects
SummaryImportance of Large-Scale Machine Learning
Need to model data-dependencies
Existing Large-Scale Machine Learning AbstractionsNeed for a efficient graph structured abstraction
GraphLab Abstraction:Addresses data-dependences Enables the expression of efficient algorithms
Experimental ResultsGraphLab dramatically outperforms existing abstractions
Open Research Challenges
Carnegie Mellon
Checkout GraphLab
http://graphlab.org
95
Documentation… Code… Tutorials…
Questions & Comments