asynchronous parallel stochastic gradient descentarxiv.org/pdf/1505.04956v5.pdfoptimization for...

Asynchronous Parallel Stochastic Gradient Descent

A Numeric Core for Scalable Distributed Machine Learning Algorithms

Janis Keuper and Franz-Josef PfreundtFraunhofer ITWM

Competence Center High Performance ComputingKaiserslautern, Germany

{janis.keuper | franz-josef.pfreundt}@itwm.fhg.de

ABSTRACTThe implementation of a vast majority of machine learning(ML) algorithms boils down to solving a numerical optimiza-tion problem. In this context, Stochastic Gradient Descent(SGD) methods have long proven to provide good results,both in terms of convergence and accuracy. Recently, sev-eral parallelization approaches have been proposed in orderto scale SGD to solve very large ML problems. At theircore, most of these approaches are following a MapReducescheme.This paper presents a novel parallel updating algorithm forSGD, which utilizes the asynchronous single-sided commu-nication paradigm. Compared to existing methods, Asyn-chronous Parallel Stochastic Gradient Descent (ASGD) pro-vides faster convergence, at linear scalability and stable ac-curacy.

1. INTRODUCTIONThe enduring success of Big Data applications, which typ-ically includes the mining, analysis and inference of verylarge datasets, is leading to a change in paradigm for ma-chine learning research objectives [4]. With plenty data athand, the traditional challenge of inferring generalizing mod-els from small sets of available training samples moves out offocus. Instead, the availability of resources like CPU time,memory size or network bandwidth has become the domi-nating limiting factor for large scale machine learning algo-rithms.In this context, algorithms which guarantee useful resultseven in the case of an early termination are of special inter-est. With limited (CPU) time, fast and stable convergenceis of high practical value, especially when the computationcan be stopped at any time and continued some time laterwhen more resources are available.Parallelization of machine learning (ML) methods has been

a rising topic for some time (refer to [1] for a comprehensiveoverview). However, until the introduction of the MapRe-duce pattern, research was mainly focused on shared mem-ory systems. This changed with the presentation of a generic

Figure 1: Evaluation of the scaling properties of dif-ferent parallel gradient descent algorithms for ma-chine learning applications on distributed memorysytems. Results show a K-Means clustering withk=10 on a 10-dimensional target space, representedby ∼1TB of training samples. Our novel ASGDmethod is not only the fastest algorithm in thistest, it also shows better than linear scaling per-formance. Outperforming the SGD parallelizationby [20] and the MapReduce based BATCH [5] op-timization, which both suffer from communicationoverheads.

MapReduce strategy for ML algorithms in [5], which showedthat most of the existing ML techniques could easily betransformed to fit the MapReduce scheme.After a short period of rather enthusiastic porting of algo-rithms to this framework, concerns started to grow if follow-ing the MapReduce ansatz truly provides a solid solution forlarge scale ML. It turns out, that MapReduce’s easy paral-lelization comes at the cost of poor scalability [16]. The mainreason for this undesired behavior resides deep down withinthe numerical properties most machine learning algorithmshave in common: an optimization problem. In this context,MapReduce works very well for the implementation of socalled batch-solver approaches, which were also used in theMapReduce framework of [5]. However, batch-solvers haveto run over the entire dataset to compute a single iterationstep. Hence, their scalability with respect to the data sizeis obviously poor.

arX

iv:1

505.

0495

6v5

[cs

.DC

] 5

Oct

201

5

Even long before parallelization had become a topic, mostML implementations avoided the known drawbacks of batch-solvers by usage of alternative online optimization methods.Most notably, Stochastic Gradient Descent (SGD) methodshave long proven to provide good results for ML optimiza-tion problems [16]. However, due to its inherent sequentialnature, SGD is hard to parallelize and even harder to scale[16]. Especially when communication latencies are causingdependency locks, which is typical for parallelization taskson distributed memory systems [20].The aim of this paper is to propose a novel, lock-free paral-lelization method for the computation of stochastic gradientoptimization for large scale machine learning algorithms oncluster environments.

1.1 Related WorkRecently, several approaches [6][20][17] [16][13] towards aneffective parallelization of the SGD optimization have beenproposed. A detailed overview and in-depth analysis of theirapplication to machine learning can be found in [20].In this section, we focus on a brief discussion of four relatedpublications, which provided the essentials for our approach:

• A theoretical framework for the analysis of SGD paral-lelization performance has been presented in [20]. Thesame paper also introduced a novel approach (calledSimuParallelSGD), which avoids communication andany locking mechanisms up to a single and final MapRe-duce step. To the best of our knowledge, SimuPar-allelSGD is currently the best performing algorithmconcerning cluster based parallelized SGD. A detaileddiscussion of this method is given in section 2.3.

• In [17], a so-called mini-BATCH update scheme hasbeen introduced. It was shown that replacing the strictonline updating mechanism of SGD with small accu-mulations of gradient steps can significantly improvethe convergence speed and robustness (also see section2.4).

• A widely noticed approach for a “lock-free” paralleliza-tion of SGD on shared memory systems has been in-troduced in [16]. The basic idea of this method isto explicitly ignore potential data races and to writeupdates directly into the memory of other processes.Given a minimum level of sparsity, they were ableto show that possible data races will neither harmthe convergence nor the accuracy of a parallel SGD.Even more, without any locking overhead, [16] setsthe current performance standard for shared memorysystems.

• A distributed version of [16] has been presented in [14],showing cross-host CPU to CPU and GPU to GPUgradient updates over Ethernet connections.

• In [8], the concept of a Partitioned Global AddressSpace programming framework (called GASPI) has beenintroduced. This provides an asynchronous, single-sided communication and parallelization scheme forcluster environments (further details in section 3.1).We build our asynchronous communication on the ba-sis of this framework.

1.2 Asynchronous SGDThe basic idea of our proposed method is to port the “lock-free”shared memory approach from [16] to distributed mem-ory systems. This is far from trivial, mostly because com-munication latencies in such systems will inevitably causeexpensive dependency locks if the communication is per-formed in common two-sided protocols (such as MPI mes-sage passing or MapReduce). This is also the motivation forSimuParallelSGD [20] to avoid communication during theoptimization: locking costs are usually much higher thanthe information gain induced by the communication.We overcome this dilemma by the application of the asyn-chronous, single-sided communication model provided by [8]:individual processes send mini-BATCH [17] updates com-pletely uninformed of the recipients status whenever theyare ready to do so. On the recipient side, available updatesare included in the local computation as available. In thisscheme, no process ever waits for any communication to besent or received. Hence, communication is literally “free” (interms of latency).Of course, such a communication scheme will cause dataraces and race conditions: updates might be (partially) over-written before they are used or even might be contra pro-ductive because the sender state is way behind the state ofthe recipient.We resolve these problems by two strategies: first, we obeythe sparsity requirements introduced by [16]. This can beachieved by sending only partial updates to a few randomrecipients. Second, we introduce a Parzen-window func-tion, selecting only those updates for local descent which arelikely to improve the local state. Figure 2 gives a schematicoverview of the ASGD algorithm’s asynchronous communi-cation scheme. The remainder of this paper is organized asfollows: first, we briefly review gradient descent methods insection 2 and discuss further aspects of the previously men-tioned related approaches in more detail. Section 3 gives aquick overview of the asynchronous communication conceptand its implementation. The actual details of the ASGDalgorithm are introduced in section 4, followed by a theo-retical analysis and an extensive experimental evaluation insection 5.

2. GRADIENT DESCENT OPTIMIZATIONFrom a strongly simplified perspective, machine learningtasks are usually about the inference of generalized mod-els from a given dataset X = {x0, . . . , xm} with xi ∈ Rn,which in case of supervised learning is also assigned withsemantic labels Y = {y0, . . . , ym}, yi ∈ R.During the learning process, the quality of a model is eval-uated by use of so-called loss-functions, which measure howwell the current model represents the given data. We writexj(w) or (xj , yj)(w) to indicate the loss of a data point forthe current parameter set w of the model function. We willalso refer to w as the state of the model. The actual learningis then the process of minimizing the loss over all samples.This is usually implemented via a gradient descent over thepartial derivative of the loss function in the parameter spaceof w.

2.1 Batch OptimizationThe numerically easiest way to solve most gradient descentoptimization problems is the so-called batch optimization.

Figure 2: Overview of the asynchronous update communication used in ASGD. Given a cluster environmentof R nodes with H threads each, the blue markers indicate different stages and scenarios of the communicationmode. I: Thread 3 of node 1 finished the computation of of its local mini-batch update. The external bufferis empty. Hence it executes the update locally and sends the resulting state to a few random recipients. II:Thread 1 of node 2 receives an update. When its local mini-batch update is ready, it will use the externalbuffer to correct its local update and then follow I. III: Shows a potential data race: two external updatesmight overlap in the external buffer of thread H − 1 of node 2. Resolving data races is discussed in section4.4.

A state wt at time t is updated by the mean gradient gen-erated by ALL samples of the available dataset. Algorithm1 gives an overview of the BATCH optimization scheme.A MapReduce parallelization for many BATCH optimized

Algorithm 1 BATCH optimization with samples X ={x0, . . . , xm}, iterations T , steps size ε and states w

1: for all t = 0 . . . T do2: Init wt+1 = 03: update wt+1 = wt − ε

∑(Xj∈X) ∂wxj(wt)

4: wt+1 = wt+1/|X|

machine learning algorithms has been introduced by [5].

2.2 Stochastic Gradient DescentIn order to overcome the drawbacks of full batch optimiza-tion, many online updating methods have been proposed.One of the most prominent is SGD. Although some proper-ties of Stochastic Gradient Descent approaches might pre-vent their successful application to some optimization do-mains, they are well established in the machine learningcommunity [2]. Following the notation in [20], SGD can beformalized in pseudo code as outlined in algorithm 2. The

Algorithm 2 SGD with samples X = {x0, . . . , xm}, itera-tions T , steps size ε and states w

Require: ε > 01: for all t = 0 . . . T do2: draw j ∈ {1 . . .m} uniformly at random3: update wt+1 ← wt − ε∂wxj(wt)

4: return wT

advantage in terms of computational cost with respect tothe number of data samples is eminent: compared to batchupdates of quadratic complexity, SGD updates come at lin-early growing iteration costs. At least for ML-applications,

SGD error rates even outperform batch algorithms in manycases [2].

Since the actual update step in line 3 of algorithm 2 playsa crucial role deriving our approach, we are simplifying thenotation in this step and denote the partial derivative of theloss-function for the remainder of this paper in terms of anupdate step ∆:

∆j(wt) := ∂wxj(wt). (1)

2.3 Parallel SGDThe current “state of the art” approach towards a parallelSGD algorithm for shared memory systems has been pre-sented in [20]. The main objective in their ansatz is to avoid

Algorithm 3 SimuParallelSGD with samples X ={x0, . . . , xm}, iterations T , steps size ε, number of threads nand states wRequire: ε > 0, n > 11: define H = bm

nc

2: randomly partition X, giving H samples to each node3: for all i ∈ {1, . . . , n} parallel do4: randomly shuffle samples on node i5: init wi

0 = 06: for all t = 0 . . . T do7: get the tth sample on the ith node and compute8: update wi

t+1 ← wit − ε∆t(w

it)

9: aggregate v = 1n

∑ni=1 w

it

10: return v

communication between working threads, thus preventingdependency locks. After a coordinated initialization step,all workers operate independently until convergence (or earlytermination). The theoretical analysis in [20] unveiled thesurprising fact that a single aggregation of the distributed

results after termination is sufficient in order to guaranteegood convergence and error rates.Given a learning rate (i.e. step size) ε and the number ofthreads n, this formalizes as shown in algorithm 3.

2.4 Mini-Batch SGDThe mini-batch modification introduced by [17] tries to unitethe advantages of online SGD with the stability of BATCHmethods. It follows the SGD scheme, but instead of up-dating after each single data sample, it aggregates severalsamples into a small batch. This mini batch is then used toperform the online update. It can be implemented as shownin algorithm 4.

Algorithm 4 Mini-Batch SGD with samples X ={x0, . . . , xm}, iterations T , steps size ε, number of threads nand mini-batch size bRequire: ε > 01: for all t = 0 . . . T do2: draw mini-batch M ← b samples from X3: Init∆wt = 04: for all x ∈M do5: aggregate update ∆w ← ∂wxj(wt)

6: update wt+1 ← wt − ε∆wt

7: return wT

3. ASYNCHRONOUS COMMUNICATIONFigure 3 shows the basic principle of the asynchronous com-munication model compared to the more commonly appliedtwo-sided synchronous message passing scheme. An overview

Figure 3: Single-sided asynchronous communicationmodel (right) compared to a typical synchronousmodel (left). The red areas indicate dependencylocks of the processes p1, p2, waiting for data or ac-knowledgements. The asynchronous model is lock-free, but comes at the price that processes neverknow if and when and in what order messages reachthe receiver. Hence, a process can only be informedabout past states of a remote computation, neverabout the current status.

of the properties, theoretical implications and pitfalls of par-allelization by asynchronous communication can be foundin [18]. For the scope of this paper, we rely on the factthat single-sided communication can be used to design lock-free parallel algorithms. This can be achieved by designpatterns propagating an early communication of data intowork-queues of remote processes. Keeping these busy at alltimes. If successful, communication virtually becomes “free”in terms of latency.

However, adopting sequential algorithms to fit such a pat-tern is far from trivial. This is mostly because of the in-evitable loss of information on the current state of the senderand receiver.

3.1 GASPI SpecificationThe Global Address Space Programming Interface (GASPI)[8] uses one-sided RDMA driven communication with remotecompletion to provide a scalable, flexible and failure tolerantparallelization framework. GASPI favors an asynchronouscommunication model over two-sided bulk communicationschemes. The open-source implementation GPI 2.01 pro-vides a C++ interface of the GASPI specification.Benchmarks for various applications2 show that the GASPIcommunication schemes can outperform MPI based imple-mentations [7] [11] for many applications.

4. THE ASGD ALGORITHMThe concept of the ASGD algorithm, as described in section1.2 and figure 2 is formalized and implemented on the basisof the SGD parallelization presented in [20]. In fact, theasynchronous communication is just added to the existingapproach. This is based on the assumption that communica-tion (if performed correctly) can only improve the gradientdescent - especially when it is “free”. If the communica-tion interval is set to infinity, ASGD will become SimuPar-allelSGD.

ParametersASGD takes several parameters which can have a stronginfluence on the convergence speed and quality (see experi-ments for details on the impact):T defines the size of the data partition for each thread, εsets the gradient step size (which needs to be fixed followingthe theoretic constraints shown in [20]), b sets the size ofthe mini-batch aggregation, and I gives the number of SGDiterations for each thread. Practically, this also equals thenumber of data points touched by each thread.

InitializationThe initialization step is straight forward and analog toSimuParallelSGD [20] : the data is split into working pack-ages of size T and distributed to the worker threads. A con-trol thread generates initial, problem dependent values forw0 and communicates w0 to all workers. From that point on,all workers run independently, following the asynchronouscommunication scheme shown in figure 2.It should be noted, that w0 also could be initialized withthe preliminary results of a previously early terminated op-timization run.

UpdatingThe online gradient descent update step is the key leveragepoint of the ASGD algorithm. The local state wi

t of threadi at iteration t is updated by an externally modified step

∆t(wit+1), which not only depends on the local ∆t(w

it+1) but

also on a possible communicated state wjt′ from an unknown

1Download available at http://www.gpi-site.com/gpi2/2Further benchmarks available at http://www.gpi-site.com/gpi2/benchmarks/

http://www.gpi-site.com/gpi2/

http://www.gpi-site.com/gpi2/benchmarks/

http://www.gpi-site.com/gpi2/benchmarks/

iteration t′ at some random thread j:

∆t(wit+1) = wi

t −1

2

(wi

t + wjt′

)+ ∆t(w

it+1) (2)

For the usage of N external buffers per thread, we generalizeequation (2) to:

∆t(wit+1) = wi

t − 1|N|+1

(∑Nn=1 (wn

t′) + wit

)+ ∆t(w

it+1),

where |N | :=∑N

n=0 λ(wnt′), λ(wn

t′) =

{1 if ‖wn

t′‖2 > 00 otherwise

(3)with N incoming messages. Figure 4 gives a schematicoverview of the update process.

4.1 Parzen-Window OptimizationAs discussed in 1.2 and shown in figure 2, the asynchronouscommunication scheme is prone to cause data races andother conditions during the update. Hence, we introducea Parzen-window like function δ(i, j) to avoid “bad” updateconditions. The data races are discussed in section 4.4.

δ(i, j) :=

{1 if ‖(wi

t − ε∆wit)− wj

t′‖2 < ‖wi

t − wjt′‖

2

0 otherwise,

(4)We consider an update to be “bad”, if the external state wj

t′

would direct the update away from the projected solution,rather than towards it. Figure 4 shows the evaluation ofδ(i, j), which is then plugged into the update functions ofASGD in order to exclude undesirable external states fromthe computation. Hence, equation (2) turns into

∆t(wit+1) =

[wi

t −1

2

(wi

t + wjt′

)]δ(i, j) + ∆t(w

it+1) (5)

and equation (3) to

∆t(wit+1) = wi

t − 1/(∑N

n=1 (δ(i, n)) + 1)

·(∑N

n=1 (δ(i, n)wnt′) + wi

t

)+∆t(w

it+1)

(6)

Computational CostsObviously, the evaluation of δ(i, j) comes at some compu-tational cost. Since δ(i, j) has to be evaluated for each re-ceived message, the “free” communication is actually not sofree after all. However, the costs are very low and can bereduced to the computation of the distance between twostates, which can be achieved linearly in the dimensional-ity of the parameter-space of w and the mini-batch size:O( 1

b|w|). Experiments in section 5 show that the impact of

the communication costs are neglectable.In practice, the communication frequency 1

bis mostly con-

strained by the network bandwidth between the computenodes, which is briefly discussed in section 4.5.

4.2 Mini-Batch ExtensionWe further alter the update of our ASGD by extending itwith the mini-batch approach introduced in section 2.4. Themotivation for this is twofold: first, we would like to benefitfrom the advantages of mini-batch updates shown in [17].Also, the sparse nature of the asynchronous communicationforces us to accumulate updates anyway. Otherwise, the

external states could only affect single SGD iteration steps.Because the communication frequency is practically boundby the node interconnection bandwidth, the size of the mini-batch b is used to control the impact of external states.We write ∆M in order to differentiate mini-batch steps fromsingle sample steps ∆t of sample xt:

∆M (wit+1) =

[wi

t −1

2

(wi

t + wjt

)]δ(i, j) + ∆M (wi

t+1) (7)

Note, that the step size ε is not independent of b and shouldbe adjusted accordingly.

4.3 The final ASGD Update AlgorithmReassembling our extension into SGD, we yield the finalASGD algorithm. With mini-batch size b, number of itera-tions T and learning rate ε the update can be implementedlike this: At termination, all nodes wi, i ∈ {1, . . . , n} hold

Algorithm 5 ASGD (X = {x0, . . . , xm}, T, ε, w0, b)

Require: ε > 0, n > 11: define H = bm

nc

2: randomly partition X, giving H samples to each node3: for all i ∈ {1, . . . , n} parallel do4: randomly shuffle samples on node i5: init wi

0 = 06: for all t = 0 . . . T do7: draw mini-batch M ← b samples from X

8: update wit+1 ← wi

t − ε∆M (wit+1)

9: send wit+1 to random node 6= i

10: return w1I

small local variations of the global result. As shown in al-gorithm 5, one can simply return one of these local models(namely w1

I ) as global result. Alternatively, we could alsoaggregate the wi

I via map reduce. Experiments in section5.5 show that in most cases the first variant is sufficient andfaster.

4.4 Data races and sparsityPotential data races during the asynchronous external up-date come in two forms: First, the complete negligence ofan update state wj because it has been completely over-written by a second state wh. Since ASGD communicationis de-facto optional, a lost message might slow down the con-vergence by a margin, but is completely harmless otherwise.The second case is more complicated: a partially overwrit-ten message, i.e. wi reads an update from wj while this isoverwritten by the update from wh.We address this data race issue based on the findings in[16]. There, it has been shown that the error which is in-duced by such data races during an SGD update is linearlybound in the number of conflicting variables and tends tounderestimate the gradient projection. [16] also showed thatfor sparse problems, where the probability of conflicts is re-duced, data race errors are negligible. For non sparse prob-lems, [16] showed that sparsity can be induced by partialupdating. We apply this approach to ASGD updates, leav-ing the choice of the partitioning to the application, e.g. forK-Means we partition along the individual cluster centers ofthe states. Additionally, the asynchronous communication

Figure 4: ASGD updating. This figure visualizes the update algorithm of a process with state wit, its

local mini-batch update ∆t(wit+1) and received external state wj

t for a simplified 1-dimensional optimizationproblem. The dotted lines indicate a projection of the expected descent path to an (local) optimum. I: Initialsetting: ∆M (wi

t+1) is computed and wjt is in the external buffer. II: Parzen-window masking of wj

t . Only if

the condition of equation (4) is met, wjt will contribute to the local update. III: Computing ∆M (wi

t+1). IV:

Updating wit+1 ← wi

t − ε∆M (wit+1).

model causes further sparsity in time, as processes read ex-ternal updates with shifted delays. This further decreasesthe probability of races.

4.5 Communication load balancingWe previously discussed that the choice of the communica-tion frequency 1

bhas a significant impact on the convergence

speed. Theoretically, more communication should be bene-ficial. However, due to the limited bandwidth, the practicallimit is expected to be far from b = 1.The choice of an optimal b strongly depends on the data(in terms of dimensionality) and the computing environ-ment: interconnection bandwidth, number of nodes, CPUsper node, NUMA layout and so on. Hence, b is a parameterwhich needs to be determined experimentally.For most of the experiments shown in section 5, we found500 ≤ b ≤ 2000 to be quite stable.

5. EXPERIMENTSWe evaluate the performance of our proposed method interms of convergence speed, scalability and error rates ofthe learning objective function using the K-Means Cluster-ing algorithm. The motivation to choose this algorithm forevaluation is twofold: First, K-Means is probably one of thesimplest machine learning algorithms known in the literature(refer to [9] for a comprehensive overview). This leaves littleroom for algorithmic optimization other than the choice ofthe numerical optimization method. Second, it is also oneof the most popular3 unsupervised learning algorithms witha wide range of applications and a large practical impact.

5.1 K-Means ClusteringK-Means is an unsupervised learning algorithm, which triesto find the underlying cluster structure of an unlabeled vec-torized dataset. Given a set of m n-dimensional pointsX = {xi}, i = 1, . . . ,m, which is to be clustered into a set ofk clusters, w = {wk}, k = 1, . . . , k. The K-Means algorithm

3The original paper [10] has been cited several thousandtimes.

finds a partition such that the squared error between theempirical mean of a cluster and the points in the cluster isminimized.It should be noted, that finding the global minimum of thesquared error over all k clusters E(w) is proven to be a NP-HARD problem [9]. Hence, all optimization methods inves-tigated in this paper are only approximations of an optimalsolution. However, it has been shown [12], that K-Meansfinds local optima which are very likely to be in close prox-imity to the global minimum if the assumed structure of kclusters is actually present in the given data.

Gradient Descent OptimizationFollowing the notation given in [3], K-Means is formalizedas minimization problem of the quantization error E(w):

E(w) =∑i

1

2(xi − wsi(w))

2, (8)

where w = {wk} is the target set of k prototypes for givenm examples {xi} and si(w) returns the index of the closestprototype to the sample xi. The gradient descent of the

quantization error E(w) is then derived as ∆(w) = ∂E(w)∂w

.For the usage with the previously defined gradient descentalgorithms, this can be reformulated to the following updatefunctions with step size ε. Algorithms 1 and 5 use a batchupdate scheme. Where the size m′ = m for the originalBATCH algorithm and m′ << m for our ASGD:

∆(wk) =1

m′

∑i

{xi − wk if k = si(w)0 otherwise

(9)

The SGD (algorithm 3) uses an online update:

∆(wk) =

{xi − wk if k = si(w)0 otherwise

(10)

ImplementationWe applied all three previously introduced gradient descentmethods to K-Means clustering: the batch optimization with

MapReduce [5] (algorithm 1), the parallel SGD [20] (algo-rithm 3) and our proposed ASGD (see algorithm 5) method.We used the C++ interface of GPI 2.0 [8] and the C++11standard library threads for local parallelization.To assure a fair comparison, all methods share the samedata IO and distribution methods, as well as an optimizedMapReduce method, which uses a tree structured commu-nication model to avoid transmission bottlenecks.

5.2 Cluster SetupThe experiments were conducted on a Linux cluster with aBeeGFS4 parallel file system. Each compute node is equippedwith dual Intel Xeon E5-2670, totaling to 16 CPUs per node,32 GB RAM and interconnected with FDR Infiniband.If not noted otherwise, we used a standard of 64 nodesto compute the experimental results (which totals to 1024CPUs).

5.3 DataWe use two different types of datasets for the experimentalevaluation and comparison of the three investigated algo-rithms: a synthetically generated collection of datasets anddata from an image classification application.

Synthetic Data SetsThe need to use synthetic datasets for evaluation arises fromseveral rather profound reasons: (I) the optimal solution isusually unknown for real data, (II) only a few very largedatasets are publicly available, and, (III) we even need acollection of datasets with varying parameters such as di-mensionality n, size m and number of clusters k in order toevaluate the scalability.The generation of the data follows a simple heuristic: givenn,m and k we randomly sample k cluster centers and thenrandomly draw m samples. Each sample is randomly drawnfrom a distribution which is uniquely generated for the in-dividual centers. Possible cluster overlaps are controlledby additional minimum cluster distance and cluster vari-ance parameters. The detailed properties of the datasetsare given in the context of the experiments.

Image ClassificationImage classification is a common task in the field of com-puter vision. Roughly speaking, the goal is to automaticallydetect and classify the content of images into a given set ofcategories like persons, cars, airplanes, bikes, furniture andso on. A common approach is to extract low level image fea-tures and then to generate a “Codebook” of universal imageparts, the so-called Bag of Features [15]. Objects are thendescribed as statistical model of these parts. The key steptowards the generation of the “Codebook” is a clustering ofthe image feature space.In our case, large numbers of d = 128 dimensional HOGfeatures [19] were extracted from a collection of images andclustered to form “Codebooks” with k = 100, . . . , 1000 en-tries.

5.4 EvaluationDue to the non-deterministic nature of stochastic methodsand the fact that the investigated K-Means algorithms might

4see www.beegfs.com for details

get stuck in local minima, we apply a 10-fold evaluation ofall experiments. If not noted otherwise, plots show the meanresults. Since the variance is usually magnitudes lower thanthe plotted scale, we neglect the display of variance bars inthe plots for the sake of readability. If needed, we report sig-nificant differences in the variance statistics separately. Tosimply the notation, we will denote the SimuParallelSGD[20] algorithm by SGD, the MapReduce baseline method [5]by BATCH and our algorithm by ASGD. For better com-parability, we give the number of iterations I as global sumover all samples that have been touched by the respectivealgorithm. Hence, IBATCH := T · |X|, ISGD := T · |CPUs|and IASGD := T · b · |CPUs|.Given runtimes are computed for optimization only, neglect-ing the initial data transfer to the nodes, which is the samefor all methods. Errors reported for the synthetic datasetsare computed as follows: We use the “ground-truth” clus-ter centers from the data generation step to measure theirdistance to the centers returned by the investigated algo-rithms. It is obvious that this measure has no absolutevalue. It is only useful to compare the relative differencesin the convergence of the algorithms. Also, it can not beexpected that a method will be able to reach a zero errorresult. This is simply because there is no absolute truth foroverlapping clusters which can be obtained from the gener-ation process without actually solving the exact NP-HARDclustering problem. Hence, the “ground-truth” is most likelyalso biased in some way.

5.5 Experimental ResultsScalingWe evaluate the runtime and scaling properties of our pro-posed algorithm in a series of experiments on synthetic andreal datasets (see section 5.3). First, we test a strong scalingscenario, where the size of the input data (in k, d and numberof samples) and the global number of iterations are constantfor each experiment, while the number of CPUs is increased.Independent of the number of iterations and CPUs, ASGDis always the fastest method, both for synthetic (see figures6,9) and real data (figure 6). Notably, it shows (slightly)better than linear scaling properties. The SGD and BATCHmethods suffer from a communication overhead which drivesthem well beyond linear scaling (which is projected by thedotted lines in the graphs). For SGD, this effect is dominantfor smaller numbers of iterations5 and softens proportionallywith the increasing number of iterations. This is due to thefact that the communication cost is independent of the num-ber of iterations.The second experiment investigates scaling in the number oftarget clusters k, given constant I, d, number of CPUs anddata size. Figure 7 shows that all methods scale better thanO(log k). While ASGD is faster than the other methods, itsscaling properties are slightly worse. This is due to the factthat the necessary sparseness of the asynchronous updates(see section 4.4) is increasing with k.

5Note: as shown in figure 9, a smaller number of iterationsis actually sufficient to solve the given problem.

Figure 5: Results of a strong scaling experiment onthe synthetic dataset with k=10, d=10 and ∼1TBdata samples for different numbers of iterations I.The related error rates are shown in figure 9.

Figure 6: Strong scaling of real data. Results forwith I = 1010 and k = 10..1000 on the image classifi-cation dataset.

Figure 7: Scaling the number of clusters k on realdata. Results for the same experiment as in figure6. Note: here, the dotted lines project a logarithmicscaling of the runtime in the number of clusters.

Convergence SpeedConvergence (in terms of iterations and time) is an impor-tant factor in large scale machine learning, where the earlytermination properties of algorithms have a huge practicalimpact. Figure 8 shows the superior convergence propertiesof ASGD. While it finally converges to similar error rates, itreaches a fixed error rate with less iterations than SGD orBATCH. As shown in figure 8, this early convergence prop-erty can result in speedups up to one order of magnitude.

106 107 108 109 1010 1011 1012

Number of gradient updates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Med

ian Error p

er cluster cen

ter

ASGDSGDBATCH

Figure 8: Convergence speed of different gradientdescent methods used to solve K-Means clusteringwith k = 100 and b = 500 on a 10-dimensional tar-get space parallelized over 1024 CPUs on a cluster.Our novel ASGD method outperforms communica-tion free SGD [20] and MapReduce based BATCH[5] optimization by the order of magnitudes.

Optimization Error after ConvergenceThe optimization error after full convergence6 for the strongscaling experiment (see section 5.5) is shown in figures 9and 10. While ASGD outperforms BATCH, it has no signif-icant difference in the mean error rates compared to SGD.However, figure 10 shows, that it tends to be more stablein terms of the variance of the non-deterministic K-Meansresults.

Figure 9: Error rates and their variance of thestrong scaling experiment on synthetic data shownin figure 5. A more detailed view of the variances isshown in figure 10.

Figure 10: Variance of the error rates of the strongscaling experiment on synthetic data shown in figure5.

Communication FrequencyTheoretically, more communication should lead to better re-sults of the ASGD algorithm, as long as the node intercon-nection provides enough bandwidth. Figure 11 shows this

6Full convergence is here defined as the state where the errorrate is not improving after several iterations.

effect: as long as the bandwidth suits the the communi-cation load, the overhead of an ASGD update is marginalcompared to the SGD baseline. However, the overhead in-creases to over 30% when the bandwidth is exceeded.

10-5 10-4 10-3 10-2

b

0

5

10

15

20

25

30

35

chan

ge in

runtim

e in %

(com

pared to SGD

)

runtime ASGD updates

Figure 11: Communication cost of ASGD. The costof higher communication frequencies 1

bin ASGD

updates compared to communication free SGD up-dates.

As indicated by the date in figure 11, we chose b = 500 forall of our experiments. However, as noted in section 4.5, anoptimal choice for b has to be found for each hardware con-figuration separately. The number of messages exchanged

Figure 12: Asynchronous communication rates dur-ing strong scaling experiment (see figure 5). Thisfigure shows the average number of messages sent orreceived by a single CPU over all iterations. “Good”messages are defined as those, which were selectedby the Parzen-window function, contributing to thelocal update.

during the strong scaling experiments is shown in figure 12.While the number of messages sent by each CPU stays closeto constant, the number of received messages is decreasing.

Notably, the impact on the asynchronous communicationremains stable, because the number of “good” messages isalso close to constant. Figure 13 shows the impact of thecommunication frequencies of 1

bon the convergence proper-

ties of ASGD. If the frequency is set to lower values, theconvergence moves towards the original SimuParallelSGDbehavior.

106 107 108 109 1010 1011 1012

Number of gradient updates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Median Error per cluster center

ASGD b=500ASGD b=10000SGDBATCH

Figure 13: Convergence speed of ASGD with a com-munication frequencies of 1

100000compared to 1

500in

relation to the other methods. Results on Syntheticdata with D = 10, k = 100.

Figure 14: Convergence speed of ASGD optimiza-tion (synthetic dataset, k = 10, d = 10) with and with-out asynchronous communication (silent).

Impact of the Asynchronous CommunicationASGD differs in two major aspects from SimuParallelSGD:asynchronous communication and mini-batch updates. Inorder to verify that the single-sided communication is thedominating factor of ASGD’s properties, we simplify turnedoff the communication (silent mode) during the optimiza-tion. Figures 14 and 15 show, that our communicationmodel is indeed driving the early convergence feature, both

in terms of iterations and time needed to reach a given errorlevel.

Figure 15: Early convergence properties of ASGDwithout communication (silent) compared to ASGDand SGD.

Final AggregationAs noted in section 4.3, the local results of ASGD could befurther aggregated by a final reduce step (just like in SGD).Figures 16 and 17 show a comparison of both approaches onthe strong scaling experiment (see figure 5).

Figure 16: Comparison of the runtime and scalabil-ity for the two possible final aggregation methods ofASGD. Synthetic dataset, k = 10, d = 10, and ∼1TBdata samples. Error rates for this experiment areshown in figure 17.

6. CONCLUSIONSWe presented a novel approach towards an effective paral-lelization of stochastic gradient descent optimization on dis-tributed memory systems. Our experiments show, that the

Figure 17: Error rates for the same experimentshown in figure 16. Comparing the final aggrega-tion steps of ASGD.

asynchronous communication scheme can be applied suc-cessfully to SGD optimizations of machine learning algo-rithms, providing superior scalability and convergence com-pared to previous methods.Especially the early convergence property of ASGD shouldbe of high practical value to many applications in large scalemachine learning.

7. REFERENCES[1] K. Bhaduri, K. Das, K. Liu, and H. Kargupta.

Distributed data mining bibliography. Inhttp://www.csee.umbc.edu/ hillol/DDMBIB/.

[2] L. Bottou. Large-scale machine learning withstochastic gradient descent. In Proceedings ofCOMPSTAT’2010, pages 177–186. Springer, 2010.

[3] L. Bottou and Y. Bengio. Convergence properties ofthe k-means algorithms. In Advances in NeuralInformation Processing Systems 7,[NIPS Conference,Denver, Colorado, USA, 1994], pages 585–592, 1994.

[4] L. Bottou and O. Bousquet. The tradeoffs oflarge-scale learning. In Neural Information ProcessingSystems 20, pages 161–168. MIT Press, 2008.

[5] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski,A. Y. Ng, and K. Olukotun. Map-reduce for machinelearning on multicore. Advances in neural informationprocessing systems, 19:281, 2007.

[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le,et al. Large scale distributed deep networks. InAdvances in Neural Information Processing Systems,pages 1223–1231, 2012.

[7] D. Grunewald. Bqcd with gpi: A case study. In HighPerformance Computing and Simulation (HPCS),2012 International Conference on, pages 388–394.IEEE, 2012.

[8] D. Grunewald and C. Simmendinger. The gaspi apispecification and its implementation gpi 2.0. In 7thInternational Conference on PGAS ProgrammingModels, volume 243, 2013.

[9] A. K. Jain. Data clustering: 50 years beyond k-means.Pattern recognition letters, 31(8):651–666, 2010.

[10] S. Lloyd. Least squares quantization in pcm.Information Theory, IEEE Transactions on,28(2):129–137, 1982.

[11] R. Machado, C. Lojewski, S. Abreu, and F.-J.Pfreundt. Unbalanced tree search on a manycoresystem using the gpi programming model. ComputerScience-Research and Development, 26(3-4):229–236,2011.

[12] M. Meila. The uniqueness of a good optimum fork-means. In Proc. 23rd Internat. Conf. MachineLearning, pages 625–632, 2006.

[13] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V.Le, and A. Y. Ng. On optimization methods for deeplearning. In Proceedings of the 28th InternationalConference on Machine Learning (ICML-11), pages265–272, 2011.

[14] C. Noel and S. Osindero. Dogwild!–distributedhogwild for cpu & gpu. 2014.

[15] E. Nowak, F. Jurie, and B. Triggs. Sampling strategiesfor bag-of-features image classification. In ComputerVision–ECCV 2006, pages 490–503. Springer, 2006.

[16] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: Alock-free approach to parallelizing stochastic gradientdescent. In Advances in Neural Information ProcessingSystems, pages 693–701, 2011.

[17] D. Sculley. Web-scale k-means clustering. InProceedings of the 19th international conference onWorld wide web, pages 1177–1178. ACM, 2010.

[18] H. Shan, B. Austin, N. J. Wright, E. Strohmaier,J. Shalf, and K. Yelick. Accelerating applications atscale using one-sided communication.

[19] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan. Fasthuman detection using a cascade of histograms oforiented gradients. In Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conferenceon, volume 2, pages 1491–1498. IEEE, 2006.

[20] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola.Parallelized stochastic gradient descent. In Advancesin Neural Information Processing Systems, pages2595–2603, 2010.

asynchronous parallel stochastic gradient descentarxiv.org/pdf/1505.04956v5.pdfoptimization for...

Documents