large scale matrix factorization: systems and accelerationbigdataieee.org › bigdata2016 › files...

Wei Tan

IBM T. J. Watson Research Center

[email protected]

http://github.com/cumf

http://researcher.ibm.com/person/us-wtan

Large Scale Matrix Factorization: Systems and Acceleration

mailto:[email protected]

http://github.com/cumf




Agenda

Fei’s talk covers the formalism/theory/math of MF

My talk focuses on “how to run it fast, scalable and cost-efficient”

– Matrix factorization, SGD and ALS (10 min)

– Parallelize and accelerate SGD and ALS (20 min)

– GPU accelerated SGD and ALS (20 min)

– Conclusion and QA (10 min)

2

Matrix Factorization

3

Ratings (R)

n items

m users

* * * *

*

*

*

*

≈

x

Users

items

T

xT

u

vX

f

f

MF Explained using Recommender Systems

How: factorize the rating matrix R into

and minimize the empirical lost:

R X

xu

ΘT θv

Input: users ratings on some items

Output: user/item features

Use: predict missing ratings; use

features for other tasks (e.g. clustering)

X

ΘT

word

word X

ΘT

word

docum

ent

Topic model Word embedding

4

Matrix Factorization is a Key

Predict missing ratings

Group similar users/items

Match query and document In machine learning and HPC applications

Matrix Factorization Link prediction

Vertices clustering

Latent semantic model

Word embedding as input to DNN

Recommender systems

Complex network

Web search

Natural language processing

Tensor decomposition

Model compression

Embedding layer

Deep learning

Supported in cuMF

To be supported

5

Ratings (R)

n items

mu

sers

* * **

*

*

*

*

≈

x

Users

items

T

xT

u

vX

f

f

Challenge: MF needs to be fast, scalable, economic

6

Fast – Recommend/update timely

Scalable – Facebook: 100 B ratings, 1 B users

Economic – Avoid large infrastructure

Stochastic gradient descent (SGD) - Update takes one rating at a time

- Vector inner product: memory bound

- Need many light epochs

- Parallelize: non-trivial

- Handle dense (implicit) ratings: no

To Solve MF: SGD

7

xu1

θv1

xu2

xu3

θv2 θv3

To Solve MF: ALS

Alternating Least Square (ALS) - Update takes ALL rating at a time

- Vector outer product & solve: compute bound

- Need few heavy epochs

- Parallelize: straightforward

- Handle dense (implicit) ratings: yes

8

xu1

θv1

xu2

θv2 θv3 θv2

To Solve MF: CD

Coordinate descent (CD) - Similar to ALS

- But update one coordinate of xu and θu at a time

9

xu1

θv1

xu2

θv2 θv3 θv2

To Parallelize ALS

Alternating Least Square (ALS) - Solve xus independently (θvs thereafter)

- Parallelize the solve of xus on multiple nodes

- Replicate ϴ

- Partially replicate ϴ

- Split ϴ on multiple nodes

10

xu1

θv1

xu2

θv2 θv3 θv2

Parallelize SGD: Hogwild!

11

xu1

θv1

xu2

xu3

θv2 θv3

worker 1

worker 2

worker 3

θv1 θv3

update conflict!

Hogwild! [Niu et al. 2011] -- parallel SGD converges despite of (occasional) update conflict

Random sampling hurts cache performance (on GPUs) -- hardware cannot prefetch

Hogwild! Not Good Enough?

12

xu1

θv1

xu2

xu3

θv2 θv3

Parallelize SGD: Matrix blocking

13

1

2

3

4

9 5

6

7

8

10 11

12

3. wave 3 1. wave 1 2. wave 2 0. Divide to blocks

Divide R into blocks, say 4*4

4 workers update 4 “non-overlapping” blocks concurrently

– Workers do not need to communicate

Parallelize SGD: Matrix blocking

14

1

2

3

4

5

6

7

8

1. wave 1 2. wave 2 Cons: all 4 workers need to complete

before the next wave

Solution: more blocks than workers

– 6*6 blocks, 4 workers

Worker 0 can immediate pick up

another block when T0 is done

Cons: scheduling overhead

MF methods with SGD, ALS and CCD

15

Challenge: compute and memory capacity of CPU

CPU offers: 1 T flops, 80 GB/s f=100, per epoch • ALS floating-point operations

- Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T

• SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB

- >> CPU flops and BW capacity

16

GPU vs. CPU: compute FLOPS and memory bandwidth

17 https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

• Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/















Goal: a CUDA library for MF

CUDA

cuMF Kernels for ALS and SGD

PU MF

Word embedding …

Collab. filtering

Fast

Fast training

Update model quickly

Deal with big data

Exploit fast interconnection

Scalable

Cost efficient

Fully utilize flops or BW

Cheaper than CPU solutions

18

Challenges of ALS

• ALS needs to solve many:

• Challenge 1: access and

aggregate many θvs:

memory irregular and

compute intensive

Challenge 2: LU or Cholesky solver: compute intensive

• Challenge 3:

Single GPU can NOT handle big m, n and Nz

19

Challenge 1: improve flops

Nvidia Pascal: Memory BW: 740 GB/sec, compute: 11 Tflops

Higher flops higher op intensity (more flops per byte) caching!

Operational intensity (Flops/Byte)

flo

ps

11 T

1

740 G ×

15

×

under-utilized

fully-utilized

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009) 20

Address challenge 1: memory-optimized ALS

• To obtain

Shared memory

Register

3. tile and aggregate in register

2. stage into smem

1. non-coalesced read

21

Address Challenge 2: exact solver is compute intensive

22

ALS iter 1 Exactly solve X

Exactly solve Θ ALS iter 2

Exactly solve X Exactly solve Θ

ALS iter n Exactly solve X Exactly solve Θ

…

ALS iter 1 Approx. solve X Approx. solve Θ

ALS iter 2 Approx. solve X Approx. solve Θ

ALS iter n Approx. solve X Approx. solve Θ

…

use fs << f O(f 3) O(f 2)

Address Challenge 2: use CG solver

23

Solver time: CG = ¼ LU

CG solver is memory- (instead of compute-) bound

CG w/ FP16 = ½ CG w/ FP32

Address Challenge 3: scale-up ALS on multiple GPUs

Model parallel: solve a

portion of the model

model

parallel

24

Address Challenge 3: scale-up ALS on multiple GPUs

Data parallel: solve

with a portion of the

training data Data parallel

model

parallel

25

Recap: challenges of ALS

• ALS needs to solve many:

• Challenge 1: access and

aggregate many θvs: memory

irregular and compute intensive

-- use register, smem and non-

coalesced read

Challenge 2: LU or Cholesky solver: compute intensive -- use approximate CG solver and FP16

• Challenge 3:

Single GPU can NOT handle big m, n and Nz

-- use model and data parallelism,

and topology aware reduction

26

Connect cuMF to Spark MLlib

Spark applications relying on mllib/ALS need no change

Modified mllib/ALS detects GPU and offload maxtix computation

Leverage the best of Spark (scale-out) and GPU (scale-up)

ALS apps

mllib/ALS

cuMF JNI

https://github.com/IBMSparkGPU/CUDA-MLlib http://www-01.ibm.com/support/docview.wss?uid=swg21983421

27

https://github.com/IBMSparkGPU/CUDA-MLlib



http://www-01.ibm.com/support/docview.wss?uid=swg21983421



Connect cuMF to Spark MLlib

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

… RDD RDD RDD

CUDA kernel

GPU2

RDD

CUDA kernel

… RDD RDD RDD

shuffle

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

… RDD RDD RDD

shuffle

…

RDD on CPU: to distribute rating data and shuffle parameters

Solver on GPU: to form and solve

Able to run on multiple nodes, and multiple GPUs per node

28

Challenges of SGD

• Iterate over all ratings and do

this in sequence:

• Memory bound

29

(a) Hogwild

Parallel worker 0 Parallel worker 1

(b) Matrix Blocking

CacheMemory

Coalescing

Half

precision

Warp shuffleILP

Register

Reuse

2. how to parallelize

1. update kernel

Experiment 1: is cuMF fast and scalable?

30

• cuMF_ALS w/ FP16 on Maxwell and Pascal

• 1 GPU for Netflix and Yahoo

• LIBMF: 1 CPU w/ 40 threads

• NOMAD

• 32 nodes for Netflix and Yahoo

• 64 HPC nodes for Hugewiki

• 2-10x as fast

Experiment 2: cuMF_ALS and cuMF_SGD (on Maxwell)

31

• ALS slightly slower than SGD on single GPU

• On big data set Hugewiki, ALS@4 GPU performs best -- SGD harder to parallel to multiple GPUs!

Experiment 3: is cuMF cost efficient?

• cuMF_ALS @4 Maxwell

≈ 1/10 SparkALS @50 nodes

≈ $2.5/hr 1/10 of 50 nodes

≈ 1% of SparkALS’s cost

32

Conclusion

Why accelerate matrix factorization using GPU?

– MF need to be fast, scalable, and economic

– GPU offers ~10x flops, memory BW, and fast interconnect

How cuMF tackles the challenges?

– Optimize memory access, parallelism and communication

– Approximate computing

– Reduced precision

What is the result?

– Implemented ALS and SGD

– Up to 10x as fast, 100x as cost-efficient

– Use cuMF standalone, with Spark or Tensorflow

33

Thank you, questions?

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016

CuMF_SGD: Fast and Scalable Matrix Factorization. CoRR abs/1610.05838, 2016

Code: http://github.com/cuMF/

Blog: http://ibm.biz/cumf-blog

Contact: Wei Tan, [email protected]

34

http://arxiv.org/abs/1603.03820

https://arxiv.org/abs/1610.05838

http://github.com/cuMF/

http://ibm.biz/cumf-blog



mailto:[email protected]

large scale matrix factorization: systems and accelerationbigdataieee.org › bigdata2016 › files...

Documents