neurips | 2018 snap ml: a hierarchical framework for ...tera-scale benchmark worker cpu gpu cpu gpu...
TRANSCRIPT
Snap ML: A Hierarchical Framework for Machine LearningC. Dünner*, T. Parnell*, D. Sarigiannis*, N. Ioannou*, A. Anghel*, G. Ravi+, M. Kandasamy+, H. Pozidis*
Snap ML is a new framework for efficient training of generalized linear models.
Snap ML implements novel out-of-core techniques to enable GPU acceleration at scale.
Snap ML is built on a novel hierarchical version of the popular CoCoA framework to enable multi-level distributed training.
Snap ML can train a logistic regression classifier on the Criteo Terabyte Click Logs data in 1.5 minutes.
Contributions A unique feature of Snap ML is its design, aligned with the architecture of modern systems
Local Solver
For large datasets the GPU-CPU link can become the bottleneck !Streaming Pipeline:
1. Using CUDA streams we can copy the next batch of data while the current is being trained
2. We use the CPU to generate random numbers for sampling in the GPU solver
[1] Parallel Model Training Without Compromising Convergence, N. Ioannou, C. Dünner, K. Kourtis, T. Parnell. MLSys workshop (2018) → oral, Fri. 7th December[2] Tera-Scale Coordinate Descent on GPUs, T. Parnell, C. Dünner, K. Atasu, M. Sifalakis, H. Pozidis. FGCS (2018)[3] Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems, C. Dünner, T. Parnell, M. Jaggi. NIPS (2017)[4] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi. JMLR (2018)
ModelsGPU
AccelerationDistributed
TrainingSparse Data
Support
Scikit-learn ML\{DL} No No Yes
Spark MLlib ML\{DL} No Yes Yes
TensorFlow ML Yes Yes Limited
Snap ML GLMs Yes Yes Yes
Tera-Scale Benchmark
worker
CPU
GPU
CPU
GPU GPU
GPU
worker
worker
worker
10Gbit/s
Hierarchical Optimization FrameworkThe Snap ML Framework
e.g. SVM, Lasso, Ridge Regression, Logistic Regression
Take advantage of non-uniform interconnects→
CPU solver: Parallel primal/dual coordinate descent solver [1]GPU solver: Twice Parallel Asynchronous Coordinate Descent [2]
Performance of Snap ML in comparison to other frameworks and previously published results for training a logistic regression classifier on the Terabyte Click-Logs dataset.
Data: # examples 4.5 billion# features 1 million
~3TB
In Snap ML the user can describe the application using high-level python APIs for single-node and multi-node applications https://www.zurich.ibm.com/snapml/
single-node performance
task: logistic regression dataset: criteo-kaggle (45 million examples)infrastructure: Power AC922 server with V100 GPU
GLMs
Level 1: distribution across nodes in a cluster
Level 2: distribution across heterogeneous compute units
Level 3: distribution across cores/threads
Consider Algorithm 1 applied to where the local subproblems are solved with relative accuracy 𝜃 in each iteration. Let 𝑓 be 𝛽-smooth and convex and 𝑔𝑖 be general convex functions. Then, after 𝑡1 outer iterations with 𝑇2 inner iterations each the suboptimality is bounded as
Furthermore, if 𝑔𝑖 are 𝜇-strongly convex this rate improves to
1
1
★
★
[0,1,2,3]
Specify which GPUs to use
Disjoint partitions
Data-local subtasks are defined by recursively applying a block-separable upper-bound to similar to [4]
1
★
task: logistic regression dataset: criteo-tera-byte (1 billion examples)infrastructure: 4x Power AC922 server with 4x V100 GPUs
*IBM Research – Zurich, Switzerland+IBM Systems – Bangalore, India
LIBLINEAR[1 core]
Vowpal Wabbit[12 cores]Spark Mllib
[512 cores]
TensorFlow[60 worker machines,
29 parameter machines]
Snap ML[16 V100 GPUs]
TensorFlow[16 V100 GPUs]
TensorFlow on Spark[12 executors]
0.128
0.129
0.130
0.131
0.132
0.133
1 10 100 1000 10000
LogL
oss
(Te
st)
Training Time (minutes)
Time(seconds)
LogL
oss
(Tes
t)
0.45
0.46
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0.55
0.56
10 100 1000
TensorFlow
Sklearn (LIBLINEAR)
Snap ML
65
70
75
80
85
90
1 10 100
Fast Network (InfiniBand)
Slow Network (1Gbit Ethernet)
# inner iterations (𝑇2)
Tim
e to
su
bo
pti
mal
ity
(sec
on
ds)
node1,node2,node3,node4
Specify which nodes to use
black: previously published results orange: run on our hardware (4xIBM Power9 with 4xNVIDIA V100 GPUs each)
Trade-off parameters𝜃 and 𝑇2
NeurIPS | 2018