gasgd: stochastic gradient descent for distributed asynchronous matrix completion via graph...

18
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning Fabio Petroni Cyber Intelligence and information Security CIS Sapienza Department of Computer Control and Management Engineering Antonio Ruberti, Sapienza University of Rome - petroni|[email protected] Foster City, Silicon Valley 6th-10th October 2014 and Leonardo Querzoni

Upload: fabio-petroni-phd

Post on 11-Jan-2017

90 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

GASGD: Stochastic Gradient Descent for DistributedAsynchronous Matrix Completion via Graph Partitioning

Fabio Petroni

Cyber Intelligence and information Security

CIS Sapienza

Department of Computer Control andManagement Engineering Antonio Ruberti,

Sapienza University of Rome -petroni|[email protected]

Foster City, Silicon Valley6th-10th October 2014

and Leonardo Querzoni

Page 2: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Matrix Completion & SGD

R Qx

P

?item vectoruser vector

LOSS(P ,Q) =X

(Rij � PiQj)2 + ...

I Stochastic Gradient Descent works by taking stepsproportional to the negative of the gradient of the LOSS.

I stochastic = P and Q are updated for each given trainingcase by a small step, toward the average gradient descent.

2 of 18

Page 3: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Scalability

7 Lengthy training stages;

7 high computational costs;

7 especially on large data sets;

7 input data may not fit in main memory.

I goal = e�ciently exploit computer cluster architectures.

...

3 of 18

Page 4: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Distributed Asynchronous SGD

R

ĐŽŵƉƵƟŶŐ�ŶŽĚĞƐ

Q3

P3

Q4

P4

Q2

P2

Q1

P1

1 2 3 4

R1 R4

R3R2

I R is splitted;

I vectors arereplicated;

I replicasconcurrentlyupdated;

I replicas deviateinconsistently;

I synchronization.

4 of 18

Page 5: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Bulk Synchronous Processing Model

ĐŽŵƉƵƟŶŐ�ŶŽĚĞƐ

ϭ͘�ůŽĐĂůĐŽŵƉƵƚĂƟŽŶ

Ϯ͘�ĐŽŵŵƵŶŝĐĂƟŽŶ

ϯ͘�ďĂƌƌŝĞƌƐLJŶĐŚƌŽŶŝnjĂƟŽŶ

5 of 18

Page 6: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Challenges 1/2

R

1 2 3 4

R1 R4

R3R2

I 1. Load balanceB

ensure that computing nodes are

fed with the same load.

I 2. Minimize communicationB

minimize vector replicas.

6 of 18

Page 7: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Challenges 2/2

...

I 3. Tune synchronization frequencyamong computing nodes.

I Current implementations synchronizevector copies:B

continuously during the epoch

(waste of resources);

Bonce after every epoch

(slow convergence).

Iepoch = a single iteration over the ratings.

7 of 18

Page 8: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Contributions

3 We mitigate the load imbalance by proposing an inputslicing solution based on graph partitioning algorithms;

3 we show how to reduce the number of shared data byproperly leveraging known characteristics of the input dataset(bipartite power-law nature);

3 we show how to leverage the tradeo↵ betweencommunication cost and algorithm convergence rate bytuning the frequency of the bulk synchronization phaseduring the computation.

8 of 18

Page 9: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Graph representation

I The rating matrix describes a bipartite graph.

x

x x

x

x

x

xx

x

x

x

x

R

I Real data: skewed power-law degree distribution.

9 of 18

Page 10: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Input partitioner

I vertex-cut performs better than edge-cut in power-law graphs.

I Assumption: the inputdata doesn’t fit inmain memory.

I Streaming algorithm.

I Balanced k-wayvertex-cut graphpartitioning:B

minimize replicas;

Bbalance edge load.

10 of 18

Page 11: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Balanced Vertex-Cut Streaming Algorithms

I Hashing: pseudo-random edge assignment;I Grid: shu✏e and split the rating matrix in identical blocks;I Greedy: [Gonzalez et al. 2012] and [Ahmed et al. 2013].

Bipartite Aware Greedy AlgorithmI Real word bipartite graphs are often significantly skewed: oneof the two sets is much bigger than the other.

I By perfectly splitting the bigger set it is possible to achieve asmaller replication factor.

I GIP (Greedy - Item Partitioned)I GUP (Greedy - User Partitioned)

11 of 18

Page 12: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Evaluation: The Data Sets

Degree distributions:

100

101

102

103

104

100 101 102 103 104 105

num

ber

of ve

rtic

es

degree

itemsusers

MovieLens 10M

100

101

102

103

104

100 101 102 103 104 105

num

ber

of ve

rtic

es

degree

itemsusers

Netflix

12 of 18

Page 13: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Experiments: Partitioning quality

MovieLens

1

10

100

8 16 32 64 128 256

replic

atio

n fact

or

partitions

hashinggridgreedy

GIPGUP

1

10

100

8 16 32 64 128 256

loa

d s

tan

da

rd d

evi

atio

n (

%)

partitions

hashinggridgreedy

GIPGUP

Netflix

1

10

100

8 16 32 64 128 256re

plic

atio

n fact

or

partitions

hashinggridgreedy

GIPGUP

1

10

100

8 16 32 64 128 256

loa

d s

tan

da

rd d

evi

atio

n (

%)

partitions

hashinggridgreedy

GIPGUP

I RF

ReplicationFactor

I RSD

RelativeStandardDeviation

13 of 18

Page 14: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Synchronization frequency

f = 1

f = 100

Netflix

567

10

20

30

405060

0 5 10 15 20 25 30

loss

(× 1

07)

epoch

gridgreedy

GUPcentralized

56789

10

20

30

405060

0 5 10 15 20 25 30

loss

(× 1

07)

epoch

gridgreedy

GUPcentralized

I f = synchronizationfrequency parameterB

number of

synchronization steps

during an epoch.

I tradeo↵ betweencommunication cost andconvergence rate.

14 of 18

Page 15: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Evaluation: SSE and Communication cost

MovieLens

1011

1012

1013

1014

1015

100 101 102 103

SS

E

frequency

grid, |C|=32greedy, |C|=32

GUP, |C|=32grid, |C|=8

greedy, |C|=8GUP, |C|=8

1

2

3

4

5

6

7

8

9

10

100 101 102 103 104 105

com

munaca

tion c

ost

(× 1

07)

frequency

hashinggrid

greedyGUP

0.010.05

0.1

0.15

0.2

0.25

101

Zoom

Netflix

1013

1014

1015

1016

1017

100 101 102 103

SS

Efrequency

grid, |C|=32greedy, |C|=32

GUP, |C|=32grid, |C|=8

greedy, |C|=8GUP, |C|=8

1

2

3

4

5

6

7

8

9

10

100 101 102 103 104 105 106

com

munaca

tion c

ost

(× 1

09)

frequency

hashinggrid

greedyGUP

0.01

0.05

0.1

0.15

0.2

101

Zoom

I SSE

between

ASGD

variants and

SGD curves

I CC

15 of 18

Page 16: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Communication cost

I T = the training set

I U = users set

I I = items set

I V = U [ I

I C = processing nodes

I RF = replication factor

I RFU = users’ RF

I RFI = items’ RF

RF =|U |RFU + |I |RFI

|V |

f = 1 ! CC ⇡ 2(|U |RFU + |I |RFI ) = 2|V |RF

f =|T ||C | ! CC ⇡ |T |(RFU + RFI )

16 of 18

Page 17: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Conclusions

I three distinct contributions aimed at improving the e�ciencyand scalability of ASGD:

1. we proposed an input slicing solution based on graphpartitioning approach that mitigates the load imbalanceamong SGD instances (i.e. better scalability);

2. we further reduce the amount of shared data by exploitingspecific characteristics of the training dataset. This provideslower communication costs during the algorithm execution(i.e. better e�ciency);

3. we introduced a synchronization frequency parameter drivinga tradeo↵ that can be accurately leveraged to further improvethe algorithm e�ciency.

17 of 18

Page 18: GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

Thank you!

Questions?

Fabio Petroni

Rome, Italy

Current position:

PhD Student in Computer Engineering, Sapienza University of Rome

Research Areas:

Recommendation Systems, Collaborative Filtering, Distributed Systems

[email protected]

21 of 21