large scale matrix factorization: systems and accelerationbigdataieee.org › bigdata2016 › files...
TRANSCRIPT
![Page 1: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/1.jpg)
Wei Tan
IBM T. J. Watson Research Center
http://github.com/cumf
http://researcher.ibm.com/person/us-wtan
Large Scale Matrix Factorization: Systems and Acceleration
![Page 2: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/2.jpg)
Agenda
Fei’s talk covers the formalism/theory/math of MF
My talk focuses on “how to run it fast, scalable and cost-efficient”
– Matrix factorization, SGD and ALS (10 min)
– Parallelize and accelerate SGD and ALS (20 min)
– GPU accelerated SGD and ALS (20 min)
– Conclusion and QA (10 min)
2
![Page 3: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/3.jpg)
Matrix Factorization
3
Ratings (R)
n items
m users
* * * *
*
*
*
*
≈
x
Users
items
T
xT
u
vX
f
f
![Page 4: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/4.jpg)
MF Explained using Recommender Systems
How: factorize the rating matrix R into
and minimize the empirical lost:
R X
xu
ΘT θv
Input: users ratings on some items
Output: user/item features
Use: predict missing ratings; use
features for other tasks (e.g. clustering)
X
ΘT
word
word X
ΘT
word
docum
ent
Topic model Word embedding
4
![Page 5: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/5.jpg)
Matrix Factorization is a Key
Predict missing ratings
Group similar users/items
Match query and document In machine learning and HPC applications
Matrix Factorization Link prediction
Vertices clustering
Latent semantic model
Word embedding as input to DNN
Recommender systems
Complex network
Web search
Natural language processing
Tensor decomposition
Model compression
Embedding layer
Deep learning
Supported in cuMF
To be supported
5
Ratings (R)
n items
mu
sers
* * **
*
*
*
*
≈
x
Users
items
T
xT
u
vX
f
f
![Page 6: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/6.jpg)
Challenge: MF needs to be fast, scalable, economic
6
Fast – Recommend/update timely
Scalable – Facebook: 100 B ratings, 1 B users
Economic – Avoid large infrastructure
![Page 7: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/7.jpg)
Stochastic gradient descent (SGD) - Update takes one rating at a time
- Vector inner product: memory bound
- Need many light epochs
- Parallelize: non-trivial
- Handle dense (implicit) ratings: no
To Solve MF: SGD
7
xu1
θv1
xu2
xu3
θv2 θv3
![Page 8: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/8.jpg)
To Solve MF: ALS
Alternating Least Square (ALS) - Update takes ALL rating at a time
- Vector outer product & solve: compute bound
- Need few heavy epochs
- Parallelize: straightforward
- Handle dense (implicit) ratings: yes
8
xu1
θv1
xu2
θv2 θv3 θv2
![Page 9: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/9.jpg)
To Solve MF: CD
Coordinate descent (CD) - Similar to ALS
- But update one coordinate of xu and θu at a time
9
xu1
θv1
xu2
θv2 θv3 θv2
![Page 10: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/10.jpg)
To Parallelize ALS
Alternating Least Square (ALS) - Solve xus independently (θvs thereafter)
- Parallelize the solve of xus on multiple nodes
- Replicate ϴ
- Partially replicate ϴ
- Split ϴ on multiple nodes
10
xu1
θv1
xu2
θv2 θv3 θv2
![Page 11: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/11.jpg)
Parallelize SGD: Hogwild!
11
xu1
θv1
xu2
xu3
θv2 θv3
worker 1
worker 2
worker 3
θv1 θv3
update conflict!
Hogwild! [Niu et al. 2011] -- parallel SGD converges despite of (occasional) update conflict
![Page 12: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/12.jpg)
Random sampling hurts cache performance (on GPUs) -- hardware cannot prefetch
Hogwild! Not Good Enough?
12
xu1
θv1
xu2
xu3
θv2 θv3
![Page 13: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/13.jpg)
Parallelize SGD: Matrix blocking
13
1
2
3
4
9 5
6
7
8
10 11
12
3. wave 3 1. wave 1 2. wave 2 0. Divide to blocks
Divide R into blocks, say 4*4
4 workers update 4 “non-overlapping” blocks concurrently
– Workers do not need to communicate
![Page 14: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/14.jpg)
Parallelize SGD: Matrix blocking
14
1
2
3
4
5
6
7
8
1. wave 1 2. wave 2 Cons: all 4 workers need to complete
before the next wave
Solution: more blocks than workers
– 6*6 blocks, 4 workers
Worker 0 can immediate pick up
another block when T0 is done
Cons: scheduling overhead
![Page 15: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/15.jpg)
MF methods with SGD, ALS and CCD
15
![Page 16: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/16.jpg)
Challenge: compute and memory capacity of CPU
CPU offers: 1 T flops, 80 GB/s f=100, per epoch • ALS floating-point operations
- Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T
• SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB
- >> CPU flops and BW capacity
16
![Page 17: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/17.jpg)
GPU vs. CPU: compute FLOPS and memory bandwidth
17 https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
• Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU
![Page 18: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/18.jpg)
Goal: a CUDA library for MF
CUDA
cuMF Kernels for ALS and SGD
PU MF
Word embedding …
Collab. filtering
Fast
Fast training
Update model quickly
Deal with big data
Exploit fast interconnection
Scalable
Cost efficient
Fully utilize flops or BW
Cheaper than CPU solutions
18
![Page 19: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/19.jpg)
Challenges of ALS
• ALS needs to solve many:
• Challenge 1: access and
aggregate many θvs:
memory irregular and
compute intensive
Challenge 2: LU or Cholesky solver: compute intensive
• Challenge 3:
Single GPU can NOT handle big m, n and Nz
19
![Page 20: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/20.jpg)
Challenge 1: improve flops
Nvidia Pascal: Memory BW: 740 GB/sec, compute: 11 Tflops
Higher flops higher op intensity (more flops per byte) caching!
Operational intensity (Flops/Byte)
flo
ps
11 T
1
740 G ×
15
×
under-utilized
fully-utilized
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009) 20
![Page 21: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/21.jpg)
Address challenge 1: memory-optimized ALS
• To obtain
Shared memory
Register
3. tile and aggregate in register
2. stage into smem
1. non-coalesced read
21
![Page 22: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/22.jpg)
Address Challenge 2: exact solver is compute intensive
22
ALS iter 1 Exactly solve X
Exactly solve Θ ALS iter 2
Exactly solve X Exactly solve Θ
ALS iter n Exactly solve X Exactly solve Θ
…
ALS iter 1 Approx. solve X Approx. solve Θ
ALS iter 2 Approx. solve X Approx. solve Θ
ALS iter n Approx. solve X Approx. solve Θ
…
use fs << f O(f 3) O(f 2)
![Page 23: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/23.jpg)
Address Challenge 2: use CG solver
23
Solver time: CG = ¼ LU
CG solver is memory- (instead of compute-) bound
CG w/ FP16 = ½ CG w/ FP32
![Page 24: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/24.jpg)
Address Challenge 3: scale-up ALS on multiple GPUs
Model parallel: solve a
portion of the model
model
parallel
24
![Page 25: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/25.jpg)
Address Challenge 3: scale-up ALS on multiple GPUs
Data parallel: solve
with a portion of the
training data Data parallel
model
parallel
25
![Page 26: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/26.jpg)
Recap: challenges of ALS
• ALS needs to solve many:
• Challenge 1: access and
aggregate many θvs: memory
irregular and compute intensive
-- use register, smem and non-
coalesced read
Challenge 2: LU or Cholesky solver: compute intensive -- use approximate CG solver and FP16
• Challenge 3:
Single GPU can NOT handle big m, n and Nz
-- use model and data parallelism,
and topology aware reduction
26
![Page 27: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/27.jpg)
Connect cuMF to Spark MLlib
Spark applications relying on mllib/ALS need no change
Modified mllib/ALS detects GPU and offload maxtix computation
Leverage the best of Spark (scale-out) and GPU (scale-up)
ALS apps
mllib/ALS
cuMF JNI
https://github.com/IBMSparkGPU/CUDA-MLlib http://www-01.ibm.com/support/docview.wss?uid=swg21983421
27
![Page 28: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/28.jpg)
Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA kernel
GPU1
RDD
CUDA kernel
… RDD RDD RDD
CUDA kernel
GPU2
RDD
CUDA kernel
… RDD RDD RDD
shuffle
1 Power 8 node + 2 K40
CUDA kernel
GPU1
RDD
CUDA kernel
… RDD RDD RDD
shuffle
…
RDD on CPU: to distribute rating data and shuffle parameters
Solver on GPU: to form and solve
Able to run on multiple nodes, and multiple GPUs per node
28
![Page 29: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/29.jpg)
Challenges of SGD
• Iterate over all ratings and do
this in sequence:
• Memory bound
29
(a) Hogwild
Parallel worker 0 Parallel worker 1
(b) Matrix Blocking
CacheMemory
Coalescing
Half
precision
Warp shuffleILP
Register
Reuse
2. how to parallelize
1. update kernel
![Page 30: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/30.jpg)
Experiment 1: is cuMF fast and scalable?
30
• cuMF_ALS w/ FP16 on Maxwell and Pascal
• 1 GPU for Netflix and Yahoo
• LIBMF: 1 CPU w/ 40 threads
• NOMAD
• 32 nodes for Netflix and Yahoo
• 64 HPC nodes for Hugewiki
• 2-10x as fast
![Page 31: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/31.jpg)
Experiment 2: cuMF_ALS and cuMF_SGD (on Maxwell)
31
• ALS slightly slower than SGD on single GPU
• On big data set Hugewiki, ALS@4 GPU performs best -- SGD harder to parallel to multiple GPUs!
![Page 32: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/32.jpg)
Experiment 3: is cuMF cost efficient?
• cuMF_ALS @4 Maxwell
≈ 1/10 SparkALS @50 nodes
≈ $2.5/hr 1/10 of 50 nodes
≈ 1% of SparkALS’s cost
32
![Page 33: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/33.jpg)
Conclusion
Why accelerate matrix factorization using GPU?
– MF need to be fast, scalable, and economic
– GPU offers ~10x flops, memory BW, and fast interconnect
How cuMF tackles the challenges?
– Optimize memory access, parallelism and communication
– Approximate computing
– Reduced precision
What is the result?
– Implemented ALS and SGD
– Up to 10x as fast, 100x as cost-efficient
– Use cuMF standalone, with Spark or Tensorflow
33
![Page 34: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/34.jpg)
Thank you, questions?
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016
CuMF_SGD: Fast and Scalable Matrix Factorization. CoRR abs/1610.05838, 2016
Code: http://github.com/cuMF/
Blog: http://ibm.biz/cumf-blog
Contact: Wei Tan, [email protected]
34
![Page 35: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration](https://reader033.vdocuments.mx/reader033/viewer/2022053017/5f1bb7e7b184644f623ed500/html5/thumbnails/35.jpg)
35