new graphlab: a new framework for parallel machine learning · 2018. 2. 23. · graphlab framework...

1

GraphLab: A New Framework For Parallel Machine Learning YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN, JOSEPH HELLERSTEIN

Presented by Hyun Ryong (Ryan) Lee

Parallel Programming is Important for MLEnd of frequency scaling -> Need parallelism to scale

2

Existing Frameworks are Unsuitable for MLHigh-Level Framework: MapReduce◦Limited Scope: Targets “embarrasingly parallel” applications»Many ML algorithms have data dependences (Belief propagation, gradient descent, ...)

◦Cannot express Iterative algorithms effectively»Many ML algorithms are iterative

3

Existing Frameworks are Unsuitable for ML

4

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

Existing Frameworks are Unsuitable for MLLow-level frameworks: Pthreads, MPI◦DAG abstraction: nodes are computations, edges are data flow

◦Expressive, but hard to program»Reason about synchronization»Load balancing»Deadlock, Livelock, ...

5

GraphLabA vertex-centric, asynchronous, shared memory abstraction for graph processing◦Updates are vertex-centric◦No barrier synchronization◦No message passing abstraction

6

GraphLab Overview

7

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

7

Data ModelData modeled as a graph with arbitrary data associated with each vertex and edge

8

Local UpdatesUpdate function: defines local computation on a scope of a vertex

9

scope

Sync MechanismsMuch like fold/reduce◦Fold: aggregate data sequentially◦Merge: parallel tree reduction of folded data◦Apply: apply updated data to global shared data table(SDT)

Either periodic, or triggered by an update

10

Data ConsistencyNeed to resolve race conditions ◦e.g. simultaneous updates on the same vertex

Provides 3 levels of consistency◦Full consistency◦Edge consistency◦Vertex consistency

Each model guarantees the scope of vertices and edges that can be modified by the Update function

11

Data Consistency

12

Scheduling

A collection of basic schedulers◦Synchronous◦Round-robin

Tasks schedulers that allow task creation/reordering◦FIFO◦Prioritized

Provides users to create their own scheduler through a Set Scheduler

13

Set Scheduler

14

GraphLab framework provides a collection of base sched-ules. To represent Jacobi style algorithms (e.g., gradi-ent descent) GraphLab provides a synchronous sched-uler which ensures that all vertices are updated simulta-neously. To represent Gauss-Seidel style algorithms (e.g.,Gibbs sampling, coordinate descent), GraphLab provides around-robin scheduler which updates all vertices sequen-tially using the most recently available data.

Many ML algorithms (e.g., Lasso, CoEM, Residual BP) re-quire more control over the tasks that are created and theorder in which they are executed. Therefore, GraphLabprovides a collection of task schedulers which permit up-date functions to add and reorder tasks. GraphLab pro-vides two classes of task schedulers. The FIFO sched-ulers only permit task creation but do not permit task re-ordering. The prioritized schedules permit task reorderingat the cost of increased overhead. For both types of taskscheduler GraphLab also provide relaxed versions whichincrease performance at the expense of reduced control:

Strict Order Relaxed OrderFIFO Single Queue Multi Queue / PartitionedPrioritized Priority Queue Approx. Priority Queue

In addition GraphLab provides the splash scheduler basedon the loopy BP schedule proposed by Gonzalez et al.[2009a] which executes tasks along spanning trees.

In the Loopy BP example, different choices of schedulingleads to different BP algorithms. Using the Synchronousscheduler corresponds to the classical implementation ofBP and using priority scheduler corresponds to ResidualBP [Elidan et al., 2006].

3.4.1 Set Scheduler

Because scheduling is important to parallel algorithm de-sign, GraphLab provides a scheduler construction frame-work called the set scheduler which enables users to safelyand easily compose custom update schedules. To use theset scheduler the user specifies a sequence of vertex setand update function pairs ((S1, f1), (S2, f2) · · · (Sk, fk)),where Si ✓ V and fi is an update function. This sequenceimplies the following execution semantics:

for i = 1 · · · k doExecute fi on all vertices in Si in parallel.Wait for all updates to complete

The amount of parallelism depends on the size of each set;the procedure is highly sequential if the set sizes are small.Executing the schedule in the manner described above canlead to the majority of the processors waiting for a few pro-cessors to complete the current set. However, by leveragingthe causal data dependencies encoded in the graph structurewe are able to construct an execution plan which identifiestasks in future sets that can be executed early while stillproducing an equivalent result.

vv1

Update1

vv2

vv5

vv3

Update2

vv4

Desired Execution Sequence

vv1

Update1

vv2

vv5

vv3

Update2

vv4

Execution Plan

vv1

vv3

vv5

vv4

vv2

Data Graph

Figure 2: A simple example of the set scheduler planning pro-cess. Given the data graph, and a desired sequence of executionwhere v1, v2 and v5 are first run in parallel, then followed by v3

and v4. If the edge consistency model is used, we observe that theexecution of v3 depends on the state of v1, v2 and v5, but the v4

only depends on the state of v5. The dependencies are encoded inthe execution plan on the right. The resulting plan allows v4 to beimmediately executed after v5 without waiting for v1 and v2.

Data

DataData DataData

Data

DataDataDataDataDataData

DataDataDataData

Data

1DataData

2DataData

3DataData

4DataData

Data Dependency Graph

Shared Data Table

CPU 1

CPU 2

CPU 3

Update1(v1)

Update2(v5)

Update1(v3)

Update1(v9)

…

Execute Update

Scheduler

Figure 3: A summary of the GraphLab framework. The user pro-vides a graph representing the computational data dependencies,as well as a SDT containing read only data. The user also picks ascheduling method or defines a custom schedule, which providesa stream of update tasks in the form of (vertex, function) pairs tothe processors.

The set scheduler compiles an execution plan by rewrit-ing the execution sequence as a Directed Acyclic Graph(DAG), where each vertex in the DAG represents an updatetask in the execution sequence and edges represent execu-tion dependencies. Fig. 2 provides an example of this pro-cess. The DAG imposes a partial ordering over tasks whichcan be compiled into a parallel execution schedule usingthe greedy algorithm described by Graham [1966].

3.5 TERMINATION ASSESSMENTEfficient parallel termination assessment can be challeng-ing. The standard termination conditions used in many it-erative ML algorithms require reasoning about the globalstate. The GraphLab framework provides two methodsfor termination assessment. The first method relies onthe scheduler which signals termination when there are noremaining tasks. This method works for algorithms likeResidual BP, which use task schedulers and stop produc-ing new tasks when they converge. The second terminationmethod relies on user provided termination functions whichexamine the SDT and signal when the algorithm has con-verged. Algorithms, like parameter learning, which rely onglobal statistics use this method.

GraphLab framework provides a collection of base sched-ules. To represent Jacobi style algorithms (e.g., gradi-ent descent) GraphLab provides a synchronous sched-uler which ensures that all vertices are updated simulta-neously. To represent Gauss-Seidel style algorithms (e.g.,Gibbs sampling, coordinate descent), GraphLab provides around-robin scheduler which updates all vertices sequen-tially using the most recently available data.

Many ML algorithms (e.g., Lasso, CoEM, Residual BP) re-quire more control over the tasks that are created and theorder in which they are executed. Therefore, GraphLabprovides a collection of task schedulers which permit up-date functions to add and reorder tasks. GraphLab pro-vides two classes of task schedulers. The FIFO sched-ulers only permit task creation but do not permit task re-ordering. The prioritized schedules permit task reorderingat the cost of increased overhead. For both types of taskscheduler GraphLab also provide relaxed versions whichincrease performance at the expense of reduced control:

Strict Order Relaxed OrderFIFO Single Queue Multi Queue / PartitionedPrioritized Priority Queue Approx. Priority Queue

In addition GraphLab provides the splash scheduler basedon the loopy BP schedule proposed by Gonzalez et al.[2009a] which executes tasks along spanning trees.

In the Loopy BP example, different choices of schedulingleads to different BP algorithms. Using the Synchronousscheduler corresponds to the classical implementation ofBP and using priority scheduler corresponds to ResidualBP [Elidan et al., 2006].

3.4.1 Set Scheduler

Because scheduling is important to parallel algorithm de-sign, GraphLab provides a scheduler construction frame-work called the set scheduler which enables users to safelyand easily compose custom update schedules. To use theset scheduler the user specifies a sequence of vertex setand update function pairs ((S1, f1), (S2, f2) · · · (Sk, fk)),where Si ✓ V and fi is an update function. This sequenceimplies the following execution semantics:

for i = 1 · · · k doExecute fi on all vertices in Si in parallel.Wait for all updates to complete

The amount of parallelism depends on the size of each set;the procedure is highly sequential if the set sizes are small.Executing the schedule in the manner described above canlead to the majority of the processors waiting for a few pro-cessors to complete the current set. However, by leveragingthe causal data dependencies encoded in the graph structurewe are able to construct an execution plan which identifiestasks in future sets that can be executed early while stillproducing an equivalent result.

vv1

Update1

vv2

vv5

vv3

Update2

vv4

Desired Execution Sequence

vv1

Update1

vv2

vv5

vv3

Update2

vv4

Execution Plan

vv1

vv3

vv5

vv4

vv2

Data Graph

Figure 2: A simple example of the set scheduler planning pro-cess. Given the data graph, and a desired sequence of executionwhere v1, v2 and v5 are first run in parallel, then followed by v3

and v4. If the edge consistency model is used, we observe that theexecution of v3 depends on the state of v1, v2 and v5, but the v4

only depends on the state of v5. The dependencies are encoded inthe execution plan on the right. The resulting plan allows v4 to beimmediately executed after v5 without waiting for v1 and v2.

Data

DataData DataData

Data

DataDataDataDataDataData

DataDataDataData

Data

1DataData

2DataData

3DataData

4DataData

Data Dependency Graph

Shared Data Table

CPU 1

CPU 2

CPU 3

Update1(v1)

Update2(v5)

Update1(v3)

Update1(v9)

…

Execute Update

Scheduler

Figure 3: A summary of the GraphLab framework. The user pro-vides a graph representing the computational data dependencies,as well as a SDT containing read only data. The user also picks ascheduling method or defines a custom schedule, which providesa stream of update tasks in the form of (vertex, function) pairs tothe processors.

The set scheduler compiles an execution plan by rewrit-ing the execution sequence as a Directed Acyclic Graph(DAG), where each vertex in the DAG represents an updatetask in the execution sequence and edges represent execu-tion dependencies. Fig. 2 provides an example of this pro-cess. The DAG imposes a partial ordering over tasks whichcan be compiled into a parallel execution schedule usingthe greedy algorithm described by Graham [1966].

3.5 TERMINATION ASSESSMENTEfficient parallel termination assessment can be challeng-ing. The standard termination conditions used in many it-erative ML algorithms require reasoning about the globalstate. The GraphLab framework provides two methodsfor termination assessment. The first method relies onthe scheduler which signals termination when there are noremaining tasks. This method works for algorithms likeResidual BP, which use task schedulers and stop produc-ing new tasks when they converge. The second terminationmethod relies on user provided termination functions whichexamine the SDT and signal when the algorithm has con-verged. Algorithms, like parameter learning, which rely onglobal statistics use this method.

Case Study: Loopy Belief Propagation

Iteratively estimates ”beliefs” about vertices◦Read in messages◦Updates marginalestimate (belief)

◦Send updatedout messages

Repeat for all variablesuntil convergence

15


Application: 3D retinal image denoising◦Represent as a 256x64x64 with each vertex as a voxel in the original image

16

3D retinal image denoising


17

3.6 SUMMARY AND IMPLEMENTATIONA GraphLab program is composed of the following parts:

1. A data graph which represents the data and compu-tational dependencies.

2. Update functions which describe local computation3. A Sync mechanism for aggregating global state.4. A data consistency model (i.e., Fully Consistent,

Edge Consistent or Vertex Consistent), which deter-mines the extent to which computation can overlap.

5. Scheduling primitives which express the order ofcomputation and may depend dynamically on the data.

We implemented an optimized version of the GraphLabframework in C++ using PThreads. The resultingGraphLab API is available under the LGPL license athttp://select.cs.cmu.edu/code. The data con-sistency models were implemented using race-free anddeadlock-free ordered locking protocols. To attain max-imum performance we addressed issues related to paral-lel memory allocation, concurrent random number gener-ation, and cache efficiency. Since mutex collisions can becostly, lock-free data structures and atomic operations wereused whenever possible. To achieve the same level of per-formance for parallel learning system, the ML communitywould have to repeatedly overcome many of the same timeconsuming systems challenges needed to build GraphLab.

The GraphLab API has the opportunity to be an interfacebetween the ML and systems communities. Parallel MLalgorithms built around the GraphLab API automaticallybenefit from developments in parallel data structures. Asnew locking protocols and parallel scheduling primitivesare incorporated into the GraphLab API, they become im-mediately available to the ML community. Systems expertscan more easily port ML algorithms to new parallel hard-ware by porting the GraphLab API.

4 CASE STUDIESTo demonstrate the expressiveness of the GraphLab ab-straction and illustrate the parallel performance gains itprovides, we implement four popular ML algorithms andevaluate these algorithms on large real-world problems us-ing a 16-core computer with 4 AMD Opteron 8384 proces-sors and 64GB of RAM.

4.1 MRF PARAMETER LEARNINGTo demonstrate how the various components of theGraphLab framework can be assembled to build a com-plete ML “pipeline,” we use GraphLab to solve a novelthree-dimensional retinal image denoising task. In this taskwe begin with raw three-dimensional laser density esti-mates, then use GraphLab to generate composite statistics,learn parameters for a large three-dimensional grid pair-wise MRF, and then finally compute expectations for eachvoxel using Loopy BP. Each of these tasks requires both

Algorithm 2: BP update functionBPUpdate(Dv, D⇤!v, Dv!⇤ 2 Sv) begin

Compute the local belief b(xv) using {D⇤!vDv}foreach (v ! t) 2 (v ! ⇤) do

Update mv!t(xt) using {D⇤!v, Dv} and �axis(vt)

from the SDT.residual

˛̨˛̨mv!t(xt)�m

oldv!t(xt)

˛̨˛̨1

if residual > Termination Bound thenAddTask(t, residual)

endend

end

Algorithm 3: Parameter Learning SyncFold(acc, vertex) begin

Return acc + image statistics on vertexendApply(acc) begin

Apply gradient step to � using acc and return �

end

local iterative computation and global aggregation as wellas several different computation schedules.

We begin by using the GraphLab data-graph to build a large(256x64x64) three dimensional MRF in which each ver-tex corresponds to a voxel in the original image. We con-nect neighboring voxels in the 6 axis aligned directions.We store the density observations and beliefs in the vertexdata and the BP messages in the directed edge data. Asshared data we store three global edge parameters whichdetermine the smoothing (accomplished using a Laplacesimilarity potentials) in each dimension. Prior to learn-ing the model parameters, we first use the GraphLab syncmechanism to compute axis-aligned averages as a proxyfor “ground-truth” smoothed images along each dimension.We then performed simultaneous learning and inferencein GraphLab by using the background sync mechanism(Alg. 3) to aggregate inferred model statistics and apply agradient descent procedure. To the best of our knowledge,this is the first time graphical model parameter learning andBP inference have been done concurrently.

Results: In Fig. 4(a) we plot the speedup of the parame-ter learning algorithm, executing inference and learning se-quentially. We found that the Splash scheduler outperformsother scheduling techniques enabling a factor 15 speedupon 16 cores. We then evaluated simultaneous parameterlearning and inference by allowing the sync mechanism torun concurrently with inference (Fig. 4(b) and Fig. 4(c)).By running a background sync at the right frequency, wefound that we can further accelerate parameter learningwhile only marginally affecting the learned parameters. InFig. 4(d) and Fig. 4(e) we plot examples of noisy and de-noised cross sections respectively.

Case Studies: Loopy Belief Propagation

18

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of Processors

Spee

dup Priority Schedule

Approx. Priority Schedule

LinearSplash Schedule

(a) Speedup

15 30 45 60 75 90 105 1200

500

1000

1500

2000

Sync Frequency (Seconds)

Tota

l Run

time

(Sec

onds

)

Total Learning Runtime

(b) Bkgnd Sync. Runtime

0 15 30 45 60 75 90 105 1200

0.5

1

1.5

2

2.5

3

3.5

4

Sync Frequency (Seconds)

Aver

age

% d

evia

tion

Average % deviation

(c) Bkgnd Sync. Error (d) Original (e) Denoised

Figure 4: Retinal Scan Denoising (a) Speedup relative to the best single processor runtime of parameter learning using priority, approxpriority, and Splash schedules. (b) The total runtime in seconds of parameter learning and (c) the average percent deviation in learnedparameters plotted against the time between gradient steps using the Splash schedule on 16 processors. (d,e) A slice of the original noisyimage and the corresponding expected pixel values after parameter learning and denoising.

4.2 GIBBS SAMPLINGThe Gibbs sampling algorithm is inherently sequential andhas frustrated efforts to build asymptotically consistent par-allel samplers. However, a standard result in parallel al-gorithms [Bertsekas and Tsitsiklis, 1989] is that for anyfixed length Gauss-Seidel schedule there exists an equiv-alent parallel execution which can be derived from a col-oring of the dependency graph. We can extract this formof parallelism using the GraphLab framework. We first useGraphLab to construct a greedy graph coloring on the MRFand then to execute an exact parallel Gibbs sampler.

We implement the standard greedy graph coloring algo-rithm in GraphLab by writing an update function whichexamines the colors of the neighboring vertices of v, andsets v to the first unused color. We use the edge consis-tency model with the parallel coloring algorithm to ensurethat the parallel execution retains the same guarantees asthe sequential version. The parallel Gauss-Seidel scheduleis then built using the GraphLab set scheduler (Sec. 3.4.1)and the coloring of the MRF. The resulting schedule con-sists of a sequence of vertex sets S1 to SC such that Si

contains all the vertices with color i. The vertex consis-tency model is sufficient since the coloring ensures full se-quential consistency.

To evaluate the GraphLab parallel Gibbs sampler we con-sider the challenging task of marginal estimation on a fac-tor graph representing a protein-protein interaction networkobtained from Elidan et al. [2006] by generating 10, 000samples. The resulting MRF has roughly 100K edges and14K vertices. As a baseline for comparison we also rana GraphLab version of the highly optimized Splash LoopyBP [Gonzalez et al., 2009b] algorithm.

Results: In Fig. 5 we present the speedup and efficiencyresults for Gibbs sampling and Loopy BP. Using the setschedule in conjunction with the planning optimization en-ables the Gibbs sampler to achieve a factor of 10 speedupon 16 processors. The execution plan takes 0.05 secondsto compute, an immaterial fraction of the 16 processor run-ning time. Because of the structure of the MRF, a large

number of colors (20) is needed and the vertex distribu-tion over colors is heavily skewed. Consequently there isa strong sequential component to running the Gibbs sam-pler on this model. In contrast the Loopy BP speedupdemonstrates considerably better scaling with factor of 15speedup on 16 processor. The larger cost per BP updatein conjunction with the ability to run a fully asynchronousschedule enables Loopy BP to achieve relatively uniformupdate efficiency compared to Gibbs sampling.

4.3 CO-EMTo illustrate how GraphLab scales in settings with largestructured models we designed and implemented a parallelversion of Co-EM [Jones, Nigam and Ghani, 2000], a semi-supervised learning algorithm for named entity recognition(NER). Given a list of noun phrases (NP) (e.g., “big ap-ple”), contexts (CT) (e.g., “citizen of ”), and co-occurencecounts for each NP-CT pair in a training corpus, CoEMtries to estimate the probability (belief) that each entity (NPor CT) belongs to a particular class (e.g., “country” or “per-son”). The CoEM update function is relatively fast, requir-ing only a few floating operations, and therefore stressesthe GraphLab implementation by requiring GraphLab tomanage massive amounts of fine-grained parallelism.

The GraphLab graph for the CoEM algorithm is a bipar-tite graph with each NP and CT represented as a vertex,connected by edges with weights corresponding to the co-occurence counts. Each vertex stores the current estimateof the belief for the corresponding entity. The update func-tion for CoEM recomputes the local belief by taking aweighted average of the adjacent vertex beliefs. The adja-cent vertices are rescheduled if the belief changes by morethan some predefined threshold (10�5).

We experimented with the following two NER datasets ob-tained from web-crawling data.

Name Classes Verts. Edges 1 CPU Runtimesmall 1 0.2 mil. 20 mil. 40 minlarge 135 2 mil. 200 mil. 8 hours

We plot in Fig. 6(a) and Fig. 6(b) the speedup obtained by

Case Studies

Other examples (Gibbs sampling, CO-EM, ...)◦Check the paper!

19

Limitations

Reports self-scaling numbers, but comparison with other frameworks or a serial baseline are missing.

Implicitly shared-memory model, how does it work for distributed systems?

Do ML algorithms really operate on large graphs?

20

ConclusionPrior parallel frameworks unsuitable for ML◦High-level: not expressive enough◦Low-level: difficult to program

GraphLab provides a vertex-centric framework on data graphs

Scalability up to 16 cores on a wide range of ML applications

FRACTAL: AN EXECUTION MODEL FOR FINE-GRAIN NESTED SPECULATIVE PARALLELISM 21

new graphlab: a new framework for parallel machine learning · 2018. 2. 23. · graphlab framework...

Documents