alex smola, professor in the machine learning department, carnegie mellon university at mlconf sf -...

Marianas Labs

Alex SmolaCMU & Marianas Labs

github.com/dmlc

Fast, Cheap and DeepScaling machine learning

SFW

http://github.com/dmlc

Marianas Labs

This Talk in 3 Slides(the systems view)

Marianas Labs

Parameter Server

Data (local or cloud)

read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update

Server


read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update


read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update

Server

Marianas Labs


read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update

Parameter Server

Multicore

Marianas Labs


read

write


local state

(or copy)

update

Parameter Server

GPUs (for Deep Learning)

Marianas Labs

Details• Parameter Server Basics

Logistic Regression (Classification)• Memory Subsystem

Matrix Factorization (Recommender)• Large Distributed State

Factorization Machines (CTR)• GPUs

MXNET and applications• Models

Marianas Labs

Estimate Click Through Rate

p(click|{z}=:y

| ad, query| {z }=:x

, w)

Marianas Labs

sparse models for advertising

Click Through Rate (CTR)• Linear function class

• Logistic regression

• Optimization Problem

• Solve distributed over many machines(typically 1TB to 1PB of data)

f(x) = hw, xi

minimize

w� log p(f |X,Y )

minimize

w

mX

i=1

log(1 + exp(�yi hw, xii)) + � kwk1

p(y|x,w) = 1

1 + exp (�y hw, xi)

Marianas Labs

Optimization Algorithm• Compute gradient on data• l1 norm is nonsmooth, hence proximal operator

• Updates for l1 are very simple

• All steps decompose by coordinates• Solve in parallel (and asynchronously)

argminw

kwk1 +�

2kw � (wt � ⌘gt)k

wi sgn(wi)max(0, |wi|� ✏)

2

Marianas Labs

Parameter Server Template

• Compute gradient on (subset of data) on each client • Send gradient from client to server asynchronouslypush(key_list,value_list,timestamp)

• Proximal gradient update on server per coordinate• Server returns parameterspull(key_list,value_list,timestamp)

Smola & Narayanamurthy, 2010, VLDBGonzalez et al., 2012, WSDMDean et al, 2012, NIPSShervashidze et al., 2013, WWWGoogle, Baidu, Facebook, Amazon, Yahoo, Microsoft

Marianas Labs

Each key segment is then duplicated into the k anti-clockwise neighbor server nodes for fault tolerance. Ifk = 1, then the segment with the mark in the example willbe duplicated at Server 3. A new node comes is first ran-domly (via a hash function) inserted into the ring, and thentakes the key segments from its clockwise neighbors. Onthe other hand, if a node is removed or if it fails, its seg-ments will be served by its nearest anticlockwise neigh-bors, who already own a duplicated copy if k > 0. Torecover a failed node, we just insert a node back into thefailed node’s previous positions and then request the seg-ment data from its anticlockwise neighbors.

4.5 Node Join and Leave

5 Evaluation5.1 Sparse Logistic RegressionSparse logistic regression is a linear binary classifier,which combines a logit loss with a sparse regularizer:

min

w2Rp

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1,

where the regularizer kwk1 has a desirable property tocontrol the number of non-zero entries in the optimal solu-tion w

⇤, but its non-smooth property makes this objectivefunction hard to be solved.

We compared parameter server with two specific-purpose systems developed by an Internet company. Forprivacy purpose, we name them System-A, and System-B respectively. The former uses an variant of the well-known L-BFGS [21, 5], while the latter runs an variantof block proximal gradient method [24], which updates ablock of parameters at each iteration according to the first-order and diagonal second-order gradients. Both systemsuse sequential consistency model, but are well optimizedin both computation and communication.

We re-implemented the algorithm used by System-B onparameter server. Besides, we made two modifications.One is that we relax the consistency model to boundeddelay. The other one is a KKT filter to avoid sending gra-dients which may do not affect the parameters.

Specifically, let gk be the global (first-order) gradienton feature k at iteration t. Then, the according parameterwk will not be changed at this iteration if wk = 0 and�� gk � due to the update rule. Therefore it is notnecessary for workers to send gk at this iteration. But aworker does not know the global gk without communica-tion, instead, we let a worker i approximate gk based onits local gradient gik by g̃k = ckg

ik/c

ik, where ck is the

global number of nonzero entries on feature k and c

ik is

the local count, which are constants and can be obtainedbefore iterating. Then, the worker skips sending gk if

wk = 0 and � �+� g̃k ��,

where � 2 [0,�] is user defined constant.

Method Consistency LOCSystem-A L-BFGS Sequential 10,000System-B Block PG Sequential 30,000Parameter Block PG Bounded Delay 300Server KKT Filter

Table 3: xx

These there systems are compared in Table 3. Notably,both System-A and System-B consist of more than 10Klines of code, but parameter server only uses less than 300.

To demonstrate the efficiency of parameter server, wecollected a computational advertisement dataset with 170Billions of examples and 65 Billions of unique features.The raw text data size is 636 TB, and the compressed for-mat is 141 TB. We run these systems on 1000 machines,each one has 16 cores, 192GB memory, and are connectedby 10GB Ethernet. For parameter server, we use 800 ma-chines to form the worker group. Each worker cachesaround 1 billions of parameters. The rest 200 machinesmake the server group, where each machine runs 10 (vir-tual) server nodes.

10−1

100

101

1010.6

1010.7

time (hour)

obje

ctiv

e v

alu

e

System−ASystem−BParameter Server

Figure 8: Convergence results of sparse logistic regres-sion, the goal is to achieve small objective value usingless time.

We run these three systems to achieve the same objec-tive value, the less time used the better. Both system-Band parameter server use 500 blocks. In addition, param-eter server fix ⌧ = 4 for the bounded delay, which meanseach worker can parallel executes 4 blocks.

8

Solving it at scale• 2014 - Li et al., OSDI’14

• 500 TB data, 1011 variables• Local file system stores files• 1000 servers (corp cloud), • 1h time, 140 MB/s learning

• 2015 - Online solver• 1.1 TB (Criteo), 8·108 variables, 4·109 samples• S3 stores files (no preprocessing) - better IO library• 5 machines (c4.8xlarge), • 1000s time, 220 MB/s learning

2014

Marianas Labs






Marianas Labs

Recommender Systems• Users u, movies m (or projects)• Function class

• Loss function for recommendation (Yelp, Netflix)

rum = hvu, wmi+ bu + bm

X

u⇠m

(hvu, wmi+ bu + bm � yum)2

Marianas Labs

Recommender Systems• Regularized Objective

• Update operations

• Very simple SGD algorithm (random pairs)• This should be cheap …

X

u⇠m

(hvu, wmi+ bu + bm + b0

� rum)2 +�

2

hkUk2

Frob

+ kV k2Frob

i

vu (1� ⌘t�)vu � ⌘twm (hvu, wmi+ bu + bm + b0 � rum)

wm (1� ⌘t�)wm � ⌘tvu (hvu, wmi+ bu + bm + b0 � rum)

memory subsystem

Marianas Labs

This should be cheap …• O(md) burst reads and O(m) random reads• Netflix dataset

m = 100 million, d = 2048 dimensions, 30 steps• Runtime should be > 4500s

• 60 GB/s memory bandwidth = 3300s• 100 ns random reads = 1200s

We get 560s. Why?Liu, Wang, Smola, RecSys 2015

Marianas Labs

Power law in Collaborative Filtering

# of ratings100 102 104 106

Mov

ie it

em c

ount

s

100

101

102

103

104 Netflix dataset

# ratings

# m

ovie

s

Marianas Labs

Key Ideas• Stratify ratings by users

(only 1 cache miss / read per user / out of core)• Keep frequent movies in cache

(stratify by blocks of movie popularity)• Avoid false sharing between sockets

(key cached in the wrong CPU causes miss)

Marianas Labs

Key Ideas

GraphChi Partitioning

Marianas Labs

Key Ideas

SC-SGD partitioning

Marianas Labs

Speed (c4.8xlarge)

Num of Dimensions0 500 1000 1500 2000

Seco

nds

0

100

200

300

400

500

600

700

800

900

1000

C-SGDFast SGLDGraphchiGraphlabBIDMach

Num of Dimensions0 500 1000 1500 2000

Seco

nds

0

1000

2000

3000

4000

5000

6000

C-SGDFast SGLDGraphchiGraphlab

Netflix - 100M, 15 iterations Yahoo - 250M, 30 iterations

g2.8xlarge

Marianas Labs

Convergence• GraphChi blocks

(users, movies) into random groups

• Poor mixing• Slow convergence

Runtime in sec (30 epoches in total)0 1000 2000 3000 4000 5000 6000

Test

ing

RM

SE

1.2

1.4

1.6

1.8

2

2.2

2.4

C-SGD at k=2048GraphChi SGD at k=2048Fast SGLD at k=2048

Marianas Labs






Marianas Labs

A Linear Model is not enough

p(y|x,w)

Marianas Labs

Factorization Machines• Linear Model

• Polynomial Expansion (Rendle, 2012)

memory hogmemory hog

too large for individual machine

f(x) = hw, xi

f(x) = hw, xi+X

i<j

xixj tr⇣V

(2)i ⌦ V

(2)j

⌘+

X

i<j<k

xixjxk tr⇣V

(3)i ⌦ V

(3)j ⌦ V

(3)k

⌘+ . . .

Marianas Labs

Prefetching to the rescue• Most keys are infrequent

(power law distribution)• Prefetch the embedding

vectors for a minibatch fromparameter server

• Compute gradients and push to server• Variable dimensionality embedding• Enforcing sparsity (ANOVA style)• Adaptive gradient normalization• Frequency adaptive regularization (CF style)

feature occurs in a minibatch. The consequence of this isthat parameters associated with frequent features are lessoverregularized. The penalty can be presented as

⌦[w, V ] =12

X

i

⇥�i

w2i

+ µi

kVi

k22⇤with �

i

/ µi

. (11)

Lemma 2 (Dynamic Regularization) Assume that we solvea factorization machines problem with the following updatesin a minibatch update setting: for each feature i,

Ii

I {[xj

]i

6= 0 for some (xj

, yj

) 2 B} (12)

wi

wi

� ⌘t

b

X

(xj ,yj)2B

@wi l(f(xj

), yj

)� ⌘t

�wi

Ii

(13)

Vi

Vi

� ⌘t

b

X

(xj ,yj)2B

@Vi l(f(xj

), yj

)� ⌘t

µVi

Ii

(14)

Here B denotes the minibatch and b the according size, andIi

denotes whether or not feature i appears in this minibatch.Then e↵ective regularization is given by

�i

= �⇢i

and µi

= µ⇢i

where ⇢i

= 1� (1� ni

/m)b ⇡ ni

b

m

Proof. The probability that a particular feature occursin a random minibatch is ⇢

i

= 1 � (1 � ni

/m)b. Observethat while the amount of regularization on the minibatchesis not independent, it is additive (and exchangeable). Hencethe expected amount of regularizations are ⇢

i

� and ⇢i

µ, re-spectively. Expanding the Taylor series in ⇢

i

yields the ap-proximation.

Note that for b = 1 we obtain the conventional frequency de-pendent regularization, whereas in the batch setting b = mwe obtain the Frobenius regularization. That is, choosingthe minibatch size conveniently allows us to interpolate be-tween both extremes e�ciently.

3.4 Putting All Things TogetherWe obtain the following model and optimization problem:

minimizew,V

1|X|

X

(x,y)

l(f(x), y) + �1 kwk1

+12

X

i

⇥�i

w2i

+ µi

kVi

k22⇤

(15)

subject to Vi

= 0 if wi

= 0 and Vij

= 0 for j > ki

where the choice of constants �1, �i

, µi

and ki

are as dis-cussed above. This model di↵ers in two key parts from stan-dard FM models: We add frequency adaptive regularizationto better model the possible nonuniform sparsity pattern inthe data. Secondly, we use sparsity regularization and mem-ory adaptive constraints to control the model size, whichbenefits both the statistical model and system performance.

4. DISTRIBUTED OPTIMIZATIONMinimizing (15) is challenging since the potentially large

model brings large communication cost. We focus on asyn-chronous optimization method, which hides synchronizationcost by communicating and computing in parallel. It hasbeen shown that asynchronous block subspace descent cane↵ectively solve nonconvex objective function with nons-mooth `1 regularizer [8]. However, it requires expensive

1 2 3

server nodes:

worker nodes:

t=1 t=1 t=2t=2 t=3 t=4

Figure 1: An illustration of asynchronous SGD. WorkersW1 and W2 first pull weights from the servers at time 1.Next W1 pushes the gradient back and therefore increasethe timestamp at the servers. Worker W3 pulls the weightimmediately after that. Then W2 and W3 push the gradientback. The delays for W1, W2, and W3 are 0, 1, and 1.

data preprocessing. In this paper, we consider asynchronousstochastic gradient descent (SGD) [8] for optimization.

4.1 Asynchronous Stochastic Gradient DescentFor simplicity we assume there is a single (virtual) server

node, which maintains the model w and V (the parameterserver framework provides for a clean abstraction for mul-tiple servers). Multiple workers run SGD independently ofeach other. Each worker repeatedly reads data, pulls recentmodel from the server, computes gradient, and then pushesback the gradient. Due to the network delay, the receivedgradient the server used to update the model may be notcomputed based on the most recent model. An example isillustrated in Figure 1.A worker can calculate the gradient of the loss in the stan-

dard way [14]. We first compute the partial gradient of f

@wif(x,w, V ) = x

i

(16)

@Vijf(x,w, V ) = x

i

[V x]j

� x2i

Vij

, (17)

Note that the term V x can be pre-computed. Invoking thechain rule yields the gradient of l. Denote by w(t) and V (t)the model stored at the server on time t. Assume that attime t the server received gradient pushed from one worker

g✓(t) @✓

l(f(x,w(t� ⌧), V (t� ⌧)), y) (18)

where ✓ can be either w or V , and ⌧ is the delay, indicatingthat the gradient was computed using the model at timet� ⌧ .As mentioned in Section 3.3, we use frequency adaptive

regularization for w and V and the sparse induce `1 normfor w. In addition, we use AdaGrad [4] to better model thepossible nonuniform sparsity in the data. In other words,assume � the scalar in (14), and scalars ⌘

V

and �V

, theserver updates V

ij

by

nij

nij

+hgVij

(t)i2

Vij

(t+ 1) Vij

(t)� ⌘V

�V

+pnij

⇣gVij

(t) + µVij

(t)⌘, (19)

where nij

is initialized to 0. Updating w is slightly di↵er-ent to V due to the nonsmooth `1 regularizer. We adoptedFTRL [10], which solves a “smoothed” proximal operatorbased on AdaGrad. Similarly denote by ⌘

w

and �w

the

Marianas Labs

Better Models

100

101

102

106

108

1010

1012

dimesion (k)

mod

el s

ize

no mem adaptionfreqencyfreqency + l1 shrk

100

101

102

107

109

1011

1013

dimesion (k)

mod

el s

ize


(a) Number of non-zero entries in V .

100

101

102

300

400

500

600

700

800

time (

sec)


100

101

102

200

300

400

500

600

dimesion (k)

time (

sec)


(b) Runtime for one iteration.

100

101

102

−3

−2.5

−2

−1.5

−1

−0.5

0

dimesion (k)

rela

tive lo

glo

ss (

%)


100

101

102

−1

−0.8

−0.6

−0.4

−0.2

0

dimesion (k)

rela

tive lo

glo

ss (

%)


(c) Relative test logloss comparing to logistic regression (k = 0 and 0 relative loss).

Figure 2: Using di↵erent adaptive memory constraints when varying the embedding dimension. Left: Criteo2. Right: CTR2.

A more interesting observation is that these memory adap-tive constraints do not a↵ect the test accuracy. To the con-trary, we even see a slight improvement when the dimensionk is greater than 8 for CTR2. The reason could be that themodel capacity control is of great importance when the di-mension k is large. And these memory adaptive constraintscan provide additional capacity control besides the `2 and`1 regularizers.

5.3 Fixed-point Compression

We evaluate lossy fixed-point compression for data com-munication. By default, both the model and gradient entriesare represented as 32 bit floats. In this experiment, we com-press these values to lower precision integers. More specifi-cally, given a bin size b and number of bits n, we representx by the following n-bit integer

z :=jxb⇥ 2n

k+ �, (23)

where � is a Bernoulli random variable chosen such as toensure that E[z] = 2n x

b

.

(Criteo 1TB)

k

what everyone else does

Marianas Labs

Faster Solver (small Criteo)

1 2 3 40

100

200

300

400

500

#byte per entry

Gig

ab

yte

Criteo2CTR2

(a) Total data sent by workers in one iteration. The compression

rates from 4-byte to 1-byte are 4.2x and 2.9x for Criteo2 and CTR2,

respectively.

1 2 3 4−2

0

2

4

6

#byte per entry

rela

tive

log

loss

(%

)

Criteo2CTR2

(b) The relative test logloss comparing to no fixed-point compression.

Figure 3: Compressing model and gradient using the fixed-point compression, where 4-byte means using the default32-bit floating-point format.

We implemented the fixed-point compression as a user-defined filter in the parameter server framework. Since mul-tiple numbers are communicated in each round, we chooseb to be the absolute maximum value of these numbers. Inaddition, we used the key caching and lossless data compres-sion (via LZ4) filters.

The results for n = 8, 16, 24 are shown in Figure 3. As ex-pected, fixed-point compression linearly reduces the networktra�c volume, since the tra�c is dominated by communi-cating the model and gradient. A more interesting observa-tion is that we obtained a 4.2x compression rate from 32-bitfloating-point to 8-bit fixed-point on Criteo2. The reason isthe latter improves the compression rate for the followinglossless LZ4 compression.

We observed di↵erent e↵ects of accuracy on these twodatasets: CTR2 is robust to the number precision, whileCriteo2 has a 6% increase of logloss if only using 1-byte pre-sentation. However, a medium compression rate even im-proves the model accuracy. This might be because the lossycompression acts as a regularization to the objective func-tion.

5.4 Comparison with LibFM

101

102

103

104

10−0.348

10−0.345

10−0.342

10−0.339

time (sec)

test

loglo

ss

LibFMDiFacto, 1 workerDiFacto, 10 workers

(a) Criteo1

100

101

102

103

10−0.193

10−0.191

10−0.189

time (sec)

test

loglo

ss

LibFMDiFacto, 1 workerDiFacto, 10 workers

(b) CTR1

Figure 4: Comparison with LibFM on a single machine. Thedata preprocessing time for LibFM is omitted.

To our best knowledge, there is no publicly released dis-tributed FM solver. Hence we only compare DiFacto to thepopular single machine package LibFM developed by Ren-dle [14]. We only report results on Criteo1 and CTR1 on asingle machine, since LibFM fails on the other two largerdatasets. We perform a similar grid search of the hyper-parameters as we did for DiFacto. As LibFM only usessingle thread, we run DiFacto with 1 worker and 1 serverin sequential execution order. We also report the perfor-mance using 10 workers and 10 servers on a single machinefor reference.The results are shown in Figure 4. As can been seen,

DiFacto converges significantly faster than LibFM, it uses2 times fewer iterations to reach the best model. This isbecause the adaptive learning rate used in DiFacto bettermodels the data sparsity and the adaptive regularization andconstraints can further accelerate the convergence. In par-ticular, the latter results in a lower test logloss on the CTR1

dataset, where the number of features exceeds the numberof examples, requiring improved capacity control.Also note that DiFacto with a single worker is twice slower

than LibFM per iteration. This is because the data commu-

sec

LibFM died on large models

Marianas Labs

0 5 10 15 200

2

4

6

8

10

# of machines

spe

ed

up

(x)

Criteo2CTR2

Figure 5: The speedup from 1 machine to 16 machines,where each machine runs 10 workers and 10 servers. Thedi↵erence between the test logloss is within 0.5%, which isomitted in the figure.

nication overhead between the worker and the server cannotbe ignored in the sequential execution. More importantly,DiFacto does not require any data preprocessing to map ar-bitrary 64-bit integer and string feature indices, which areused in both Criteo1 and CTR2, to continuous integer indices.The cost of this data preprocessing step, required by LibFMbut not shown in Figure 4, even exceeds the cost of train-ing (1,400 seconds for Criteo1). Nevertheless, DiFacto with asingle worker still outperforms LibFM thanks to the fasterconvergence. In addition, it is 10 times faster than LibFMwhen using 10 workers on the same machine.

5.5 ScalabilityFinally, we study the scalability of DiFacto by varying the

number of physical machines used in training. We run 10workers and 10 servers in each machine, and increase thenumber of machines from 1 to 16. Both the time for eachiteration and the logloss on the test dataset are recorded andshown in Figure 5.

We observe an 8x speedup from 1 machine to 16 machineson both Criteo2 and CTR2. The reason for the satisfactoryperformance is twofold. First, asynchronous SGD eliminatesthe need for synchronization between workers and it is toler-ant to stragglers. Even though we used dedicated machinesfor the job, they still share network bandwidth with oth-ers. In particular, we observed a large variation of readspeed when streaming data from Amazon’s S3 service, de-spite using the IO optimized c4.8xlarge series of machines.Second, DiFacto uses several filters which e↵ectively reducethe amount of network tra�c. Even though CTR2 produces10 times more network tra�c than Criteo2, they have similarspeedup performance.

There is a longstanding suspicion that the convergence ofasynchronous SGD slows down when increasing the numberof workers. Nonetheless, we did not observe a substantialdi↵erence in model accuracy. In other words, the relative dif-ference of the objective logloss on test datasets is below 0.5%when increasing the number of workers from 10 to 160. Thereason might because the datasets we used are highly sparseand the features are not extremely correlated. Hence incon-sistency due to concurrently updating by multiple workers

may not have a major e↵ect.

6. CONCLUSIONIn this paper we presented DiFacto, a high performance

distributed factorization machine. DiFacto uses a refinedfactorization machine model with adaptive memory con-straints and frequency adaptive regularization, which per-form fine-grain control of model capacity based on both dataand model statistics. DiFacto solves the model based onasynchronous stochastic gradient descent. We analyzed thetheoretical convergence and implemented it in the parameterserver framework. We evaluated DiFacto thoroughly on tworeal computational advertising datasets with up to billions ofexamples and features. We showed that DiFacto produceshundreds more compact model, yet it retains similar gen-eralization accuracy. We also demonstrated that DiFactoconverges faster than the state-of-the-art package LibFM.Furthermore, DiFacto is able to solve terabyte scale prob-lems on modest machines within few hours.

References[1] A. Ahmed, Nino Shervashidze, Shravan Narayana-

murthy, Vanja Josifovski, and Alexander J. Smola. Dis-tributed large-scale natural graph factorization. InWorld Wide Web Conference, Rio de Janeiro, 2013.

[2] R. M. Bell and Y. Koren. Lessons from the netflixprize challenge. SIGKDD Explorations, 9(2):75–79,2007. URL http://doi.acm.org/10.1145/1345448.

1345465.[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,

Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker,K. Yang, and A. Ng. Large scale distributed deep net-works. In Neural Information Processing Systems, 2012.

[4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic opti-mization. Journal of Machine Learning Research, 12:2121–2159, 2010.

[5] Criteo Labs. Criteo terabyte click logs, 2014.http://labs.criteo.com/downloads/download-terabyte-click-logs.

[6] Q.V. Le, T. Sarlos, and A. J. Smola. Fastfood — com-puting hilbert space expansions in loglinear time. InInternational Conference on Machine Learning, 2013.

[7] M. Li, D. G. Andersen, J. Park, A. J. Smola, A. Amhed,V. Josifovski, J. Long, E. Shekita, and B. Y. Su. Scalingdistributed machine learning with the parameter server.In OSDI, 2014.

[8] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Com-munication e�cient distributed machine learning withthe parameter server. In Neural Information ProcessingSystems, 2014.

[9] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu.Asynchronous parallel stochastic gradient for noncon-vex optimization. arXiv preprint arXiv:1506.08272,2015.

[10] B. McMahan. Follow-the-regularized-leader and mirrordescent: Equivalence theorems and l1 regularization. InInternational Conference on Artificial Intelligence andStatistics, pages 525–533, 2011.

[11] H Brendan McMahan, Gary Holt, D Sculley, MichaelYoung, Dietmar Ebner, Julian Grady, Lan Nie, Todd

Multiple Machines Li, Wang, Liu, Smola, WSDM’16

80x LibFM speed on 16 machines

Marianas Labs





MXNET and applications• Data & Models

Marianas Labs

MXNET - Distributed Deep Learning• Several deep learning frameworks

• Torch7: matlab-like tensor computation in Lua• Theano: symbolic programming in Python • Caffe: configuration in Protobuf

• Problem - need to learn another system for a different programing flavor

• MXNET• provides programming interfaces in a consistent way• multiple languages: C++/Python/R/Julia/…• fast, memory efficient, and distributed training

Marianas Labs

Programming Interfaces• Tensor Algebra Interface

• compatible (e.g. NumPy)• GPU/CPU/ARM (mobile)

• Symbolic Expression• Easy Deep Network

design• Automatic differentiation

• Mixed programming• CPU/GPU• symbolic + local code

Python

Julia

Marianas Labs

from data import ilsvrc12_iteratorimport mxnet as mx

## define alexnetinput_data = mx.symbol.Variable(name="data")# stage 1conv1 = mx.symbol.Convolution(data=input_data, kernel=(11, 11), stride=(4, 4), num_filter=96)relu1 = mx.symbol.Activation(data=conv1, act_type="relu")pool1 = mx.symbol.Pooling(data=relu1, pool_type="max", kernel=(3, 3), stride=(2,2))lrn1 = mx.symbol.LRN(data=pool1, alpha=0.0001, beta=0.75, knorm=1, nsize=5)# stage 2conv2 = mx.symbol.Convolution(data=lrn1, kernel=(5, 5), pad=(2, 2), num_filter=256)relu2 = mx.symbol.Activation(data=conv2, act_type="relu")pool2 = mx.symbol.Pooling(data=relu2, kernel=(3, 3), stride=(2, 2), pool_type="max")lrn2 = mx.symbol.LRN(data=pool2, alpha=0.0001, beta=0.75, knorm=1, nsize=5)# stage 3conv3 = mx.symbol.Convolution(data=lrn2, kernel=(3, 3), pad=(1, 1), num_filter=384)relu3 = mx.symbol.Activation(data=conv3, act_type="relu")conv4 = mx.symbol.Convolution(data=relu3, kernel=(3, 3), pad=(1, 1), num_filter=384)relu4 = mx.symbol.Activation(data=conv4, act_type="relu")conv5 = mx.symbol.Convolution(data=relu4, kernel=(3, 3), pad=(1, 1), num_filter=256)relu5 = mx.symbol.Activation(data=conv5, act_type="relu")pool3 = mx.symbol.Pooling(data=relu5, kernel=(3, 3), stride=(2, 2), pool_type="max")# stage 4flatten = mx.symbol.Flatten(data=pool3)fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=4096)relu6 = mx.symbol.Activation(data=fc1, act_type="relu")dropout1 = mx.symbol.Dropout(data=relu6, p=0.5)# stage 5fc2 = mx.symbol.FullyConnected(data=dropout1, num_hidden=4096)relu7 = mx.symbol.Activation(data=fc2, act_type="relu")dropout2 = mx.symbol.Dropout(data=relu7, p=0.5)# stage 6fc3 = mx.symbol.FullyConnected(data=dropout2, num_hidden=1000)softmax = mx.symbol.SoftmaxOutput(data=fc3, name='softmax')

## databatch_size = 256train, val = ilsvrc12_iterator(batch_size=batch_size, input_shape=(3,224,224))

## trainnum_gpus = 4gpus = [mx.gpu(i) for i in range(num_gpus)]model = mx.model.FeedForward( ctx = gpus, symbol = softmax, num_round = 20, learning_rate = 0.01, momentum = 0.9, wd = 0.00001)model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size))

6 layers definition6 layers definition

multi GPUmulti GPUtrain ittrain it

Marianas Labs

Engine• All operations issued into backend C++ engine

Binding languages’ performance does not matter • Engine builds the read/write dependency graph

• lazy evaluation• parallel execution• sophisticated memory

allocation

Marianas Labs

Distributed Deep Learning

Marianas Labs

Distributed Deep Learning

Plays well with S3 Spot instances &

containers

Marianas Labs

• Imagenet datasets (1m images, 1k classes)• Google Inception network

(88 lines of code)

Results

40% faster due to parallelization

1 2 40

100

200

300

400

# o

f im

ages

/ se

c

# of GPUs

CXXNetMXNet

naive inplace sharing both0

1

2

3

4

5

mem

ory

usa

ge (

GB

)

50% GPU memory reduction

Marianas Labs

Distributed Results

x

0

2

4

6

8

Mbp

s

0

300

600

900

1200

machine

1 2 3 4 5 6 7 8 9 10

bandwidth speedup

g2.2xlarge network limit

100

102

0.2

0.3

0.4

0.5

0.6

0.7

time (hour)

acc

ura

cy

single GPU4 x dual GPUs

Convergence on 4 x dual GTX980

Marianas Labs

Amazon g2.8xlarge• 12 instances (48 GPUs) @ $0.50/h spot• Minibatch size 512• BSP with 1 delay between machines• 2 GB/s bandwidth between machines (awful)

10.113.170.187,10.157.109.227, 10.169.170.55, 10.136.52.151, 10.45.64.250, 10.166.137.100, 10.97.167.10, 10.97.187.157, 10.61.128.107, 10.171.105.160, 10.203.143.220, 10.45.71.20 (all over the place in availability zone)

• Compressing to 1 byte per coordinate helps a bit but adds latency due to extra pass (need to fix)

• 37x speedup on 48 GPUS• Imagenet’12 dataset in trained in 4h, i.e. $24

(with alexnet; googlenet even better for network)

Marianas Labs

Mobile, too• 10 hours on 10 GTX980• Deployed on Android

Marianas Labs

Plans• This month (went live 2 months ago)

• 700 stars• 27 authors with

>1k commits• Ranks 5th on Github C++ project trending

• Build a larger community• Many more applications & platforms

• Image/video analysis, Voice recognition• Language model, machine translation• Intel drivers

Marianas Labs

Side-effects of using MXNET

Zichao was using Theano

Marianas Labs

Tensorflow vs. MXNETLanguages MultiGPU Distributed Mobile Runtime

Engine

Tensorflow Python Yes No Yes ?

MXNET Python, R, Julia, Go Yes Yes Yes Yes

(Googlenet)E5-1650/980

Tensor flow Torch7 Caffe MXNET

Time 940ms 172ms 170ms 180ms

Memory all (OOM after 24) 2.1GB 2.2GB 1.6GB

Marianas Labs

More numbers (this morning)

Googlenetbatch 32 (24)

Alexnetbatch 128 VGG

batch 32 (24)

Torch7 172ms 2.1GB 135ms 1.6GB 459ms 3.4gb

Caffe 170ms 2.2GB 165ms 1.34GB 448ms 3.4gb

MXNET172ms

(new and improved)

1.5GB 147ms 1.38GB 456ms 3.6gb

Tensor flow 940ms - 433ms - 980ms -

Marianas Labs






Marianas Labs

Models• Fully connected networks (generic data)

(70s and 80s … trivial to scale, see e.g. CAFFE)• Convolutional networks (1D, 2D grids)

(90s - images, 2014 - text … see e.g. CUDNN)• Recurrent networks (time series, text)

(90s - LSTM - latent variable autoregressive)• Memory

• Attention model (Cho et al, 2014)• Memory model (Weston et al, 2014)• Memory rewriting - NTM (Graves et al, 2014)

Marianas Labs

Marianas Labs

Memory Networks(Sukhbataar, Szlam, Fergus, Weston, 2015)

attention layer

state layer

query

Marianas Labs

Memory Networks(Sukhbataar, Szlam, Fergus, Weston, 2015)

• Query weirdly defined (via bilinear embedding matrix)

• Works on facts / text only(ingest sequence and weigh it)

• ‘Ask me anything’ by Socher et al. 2015 (uses GRU rather than LSTMs)

Marianas Labs

Memory networks for images

http://arxiv.org/abs/1511.02274 (Yang et al, 2015)

http://arxiv.org/abs/1511.02274

Marianas Labs

ResultsBest results on

• DAQUAR• DAQUAR-

REDUCED• COCO-QA

Marianas Labs

Marianas Labs

Summary• Parameter Server Basics

Logistic Regression • Large Distributed State

Factorization Machines • Memory Subsystem

Matrix Factorization• GPUs



read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update


read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update


read

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

proc

ess

write


local state

(or copy)

update

Server

Server

github.com/dmlc

Marianas Labs

Many thanks to• CMU & Marianas

• Mu Li • Ziqi Liu • Yu-Xiang Wang• Zichao Zhang

• Microsoft• Xiaodong He• Jianfeng Gao• Li Deng

• MXNET• Tianqi Chen (UW)• Bing Xu• Naiyang Wang• Minjie Wang• Tianjun Xiao• Jianpeng Li• Jiaxing Zhang• Chuntao Hong

alex smola, professor in the machine learning department, carnegie mellon university at mlconf sf -...

Technology