neurips | 2018 snap ml: a hierarchical framework for ...tera-scale benchmark worker cpu gpu cpu gpu...

Snap ML: A Hierarchical Framework for Machine LearningC. Dünner*, T. Parnell*, D. Sarigiannis*, N. Ioannou*, A. Anghel*, G. Ravi+, M. Kandasamy+, H. Pozidis*

Snap ML is a new framework for efficient training of generalized linear models.

Snap ML implements novel out-of-core techniques to enable GPU acceleration at scale.

Snap ML is built on a novel hierarchical version of the popular CoCoA framework to enable multi-level distributed training.

Snap ML can train a logistic regression classifier on the Criteo Terabyte Click Logs data in 1.5 minutes.

Contributions A unique feature of Snap ML is its design, aligned with the architecture of modern systems

Local Solver

For large datasets the GPU-CPU link can become the bottleneck !Streaming Pipeline:

1. Using CUDA streams we can copy the next batch of data while the current is being trained

2. We use the CPU to generate random numbers for sampling in the GPU solver

[1] Parallel Model Training Without Compromising Convergence, N. Ioannou, C. Dünner, K. Kourtis, T. Parnell. MLSys workshop (2018) → oral, Fri. 7th December[2] Tera-Scale Coordinate Descent on GPUs, T. Parnell, C. Dünner, K. Atasu, M. Sifalakis, H. Pozidis. FGCS (2018)[3] Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems, C. Dünner, T. Parnell, M. Jaggi. NIPS (2017)[4] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi. JMLR (2018)

ModelsGPU

AccelerationDistributed

TrainingSparse Data

Support

Scikit-learn ML\{DL} No No Yes

Spark MLlib ML\{DL} No Yes Yes

TensorFlow ML Yes Yes Limited

Snap ML GLMs Yes Yes Yes

Tera-Scale Benchmark

worker

CPU

GPU

CPU

GPU GPU

GPU

worker

worker

worker

10Gbit/s

Hierarchical Optimization FrameworkThe Snap ML Framework

e.g. SVM, Lasso, Ridge Regression, Logistic Regression

Take advantage of non-uniform interconnects→

CPU solver: Parallel primal/dual coordinate descent solver [1]GPU solver: Twice Parallel Asynchronous Coordinate Descent [2]

Performance of Snap ML in comparison to other frameworks and previously published results for training a logistic regression classifier on the Terabyte Click-Logs dataset.

Data: # examples 4.5 billion# features 1 million

~3TB

In Snap ML the user can describe the application using high-level python APIs for single-node and multi-node applications https://www.zurich.ibm.com/snapml/

single-node performance

task: logistic regression dataset: criteo-kaggle (45 million examples)infrastructure: Power AC922 server with V100 GPU

GLMs

Level 1: distribution across nodes in a cluster

Level 2: distribution across heterogeneous compute units

Level 3: distribution across cores/threads

Consider Algorithm 1 applied to where the local subproblems are solved with relative accuracy 𝜃 in each iteration. Let 𝑓 be 𝛽-smooth and convex and 𝑔𝑖 be general convex functions. Then, after 𝑡1 outer iterations with 𝑇2 inner iterations each the suboptimality is bounded as

Furthermore, if 𝑔𝑖 are 𝜇-strongly convex this rate improves to

1

1

★

★

[0,1,2,3]

Specify which GPUs to use

Disjoint partitions

Data-local subtasks are defined by recursively applying a block-separable upper-bound to similar to [4]

1

★

task: logistic regression dataset: criteo-tera-byte (1 billion examples)infrastructure: 4x Power AC922 server with 4x V100 GPUs

*IBM Research – Zurich, Switzerland+IBM Systems – Bangalore, India

LIBLINEAR[1 core]

Vowpal Wabbit[12 cores]Spark Mllib

[512 cores]

TensorFlow[60 worker machines,

29 parameter machines]

Snap ML[16 V100 GPUs]

TensorFlow[16 V100 GPUs]

TensorFlow on Spark[12 executors]

0.128

0.129

0.130

0.131

0.132

0.133

1 10 100 1000 10000

LogL

oss

(Te

st)

Training Time (minutes)

Time(seconds)

LogL

oss

(Tes

t)

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

0.55

0.56

10 100 1000

TensorFlow

Sklearn (LIBLINEAR)

Snap ML

65

70

75

80

85

90

1 10 100

Fast Network (InfiniBand)

Slow Network (1Gbit Ethernet)

# inner iterations (𝑇2)

Tim

e to

su

bo

pti

mal

ity

(sec

on

ds)

node1,node2,node3,node4

Specify which nodes to use

black: previously published results orange: run on our hardware (4xIBM Power9 with 4xNVIDIA V100 GPUs each)

Trade-off parameters𝜃 and 𝑇2

NeurIPS | 2018

neurips | 2018 snap ml: a hierarchical framework for ...tera-scale benchmark worker cpu gpu cpu gpu...

Documents