optimizing efficiency of deep learning workloads...

41
Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization Presenters: Tim Kaldewey – Performance Architect, Watson Group Michael Gschwind – Chief Engineer ML & DL, Systems Group David K. Tam – Performance Analyst, Watson Health Group © 2017 International Business Machines Corporation

Upload: others

Post on 22-Feb-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Presenters:Tim Kaldewey – Performance Architect, Watson GroupMichael Gschwind – Chief Engineer ML & DL, Systems GroupDavid K. Tam – Performance Analyst, Watson Health Group

© 2017 International Business Machines Corporation

Page 2: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

DisclaimerThe author's views expressed in this presentation do not necessarily reflect the views of IBM.

Page 3: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

DL @ scale

Page 4: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

DL Scaling

92

48

32

25

1 2 3 4

Trai

ning

Tim

e [h

]

#GPUs

0%

10%

20%

30%

40%

50%

60%

70%

80%

Sync

Tim

e vs

. Tot

al T

ime

Batch Size

comm period 4 comm period 2

Communication overhead(16 MPI workers)Scaling #GPUs

Page 5: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

DL $

$$$$/month$$,$$$

Page 6: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

K10 K20 K20X K40 K80

Efficiency [GFLOPS/W] 20.4 15.6 16.82 21.3 29.1

Efficiency

Page 7: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

K10 K20 K20X K40 K80

Efficiency [GFLOPS/W] 20.4 15.6 16.82 21.3 29.1

Efficiency

Are DL workloads making efficient use of compute resources ?

Page 8: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

A Workload:

Page 9: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

A Workload:

*for more details on hardware/software configuration(s) please refer to S7368 - PowerAI: A Co-Optimized Software Stack for AI on Power

Training setup• Multi-layer CNN for language

classification• 100s of intents• 1000s of sentences• Small batches sizes• Training 200 epochs• Torch-based implementation

using single GPU*

Hardware config*• IBM S822LC• 20 cores @ 3.5 GHz• 512 GB RAM• 2 x K80 GPUs• 4 VMs with 5 cores, 120 GB mem, 1 GPU

Page 10: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

System utilization as a measure of efficiency

01020304050607080

5 130 255 380 505 630 755 880 1005 1130 1255

Util

izat

ion

(%)

Time (seconds)

1 GPU chip, reported by nvidia-smi

5 POWER8 cores, reported by mpstat

Page 11: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

GPU computation as a measure of efficiencyInvocations Metric Name Metric Description Min Max Avg

Kernel: void indexSelectLargeIndex737 flop_count_sp Floating Point Operations(Single Precisi 1700 6300 2343

Kernel: sgemm_sm35_ldg_nt_128x8x128x16x164916 flop_count_sp Floating Point Operations(Single Precisi 6537216 23052288 8080324

Kernel: void THCudaTensor_pointwis2459 flop_count_sp Floating Point Operations(Single Precisi 3847422 3847422 3847422

...

• Nvprof reports the #FLOPs

Page 12: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

GPU computation as a measure of efficiencyInvocations Metric Name Metric Description Min Max Avg

Kernel: void indexSelectLargeIndex737 flop_count_sp Floating Point Operations(Single Precisi 1700 6300 2343

Kernel: sgemm_sm35_ldg_nt_128x8x128x16x164916 flop_count_sp Floating Point Operations(Single Precisi 6537216 23052288 8080324

Kernel: void THCudaTensor_pointwis2459 flop_count_sp Floating Point Operations(Single Precisi 3847422 3847422 3847422

...

• Nvprof reports the #FLOPs

• Or knowing the network layout, #FLOP can be calculated …

➔ 670 Billion FLOP in 5.6 seconds (1 epoch) ≈ 120 GFLOPsvs.

4.4 TFLOPs peak on single K80 chip

Page 13: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Runtime in CUDA libraries as a measure of efficiency

libcuda, 50.4%

luajit, 11.8%

libpthread, 11.0%

libcudart, 3.9%

libc, 3.5%

libcublas, 2.1%

libTHC, 1.9%libluaT, 1.1% other, 14.4%

libcuda, 35.5%

luajit, 22.6%

libpthread, 14.5%

libTHC, 7.3%

libc, 5.3%

libcublas, 2.5%libluaT, 1.9%

libTHCUNN, 1.3%

libcutorch, 1.2% libTH, 1.2%

other, 6.7%

On S822LC (K80)• Training ~20 mins for 200 epochs • 56 % spend on GPU computation

On S822LC for HPC (P100)• Training ~12 mins for 200 epochs• 38% spend on GPU computation

Page 14: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Increasing GPU utilization

• Better exploitation by single program• Alternative frameworks*

• GPU resource sharing• Concurrent tasks and virtualization

*S. Gupta et al., “Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study”, IEEE International Conference on Data Mining 2016 (ICDM 2016)

Page 15: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

GPU Resource Sharing Options

• Assignment of GPUs to VMs (PCIe Pass-Through)

• Multiple GPU Contexts (Streams)

• Multi-Process Service (MPS)

• Multi-Process Service and Docker (MPS + Docker)

Page 16: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

VM

OS

Assignment of GPUs to VMs (PCIe Pass-Through)

• Assign each physical GPU to a virtual machine (VM)• Hypervisor driver maps GPU directly into assigned VM• GPU driver runs inside VM

C Isolation

D Insufficient resource management granularityð Inefficient GPU utilization

GPU

KVM PCIePassthrough

driver

GPUdriver

App

App

App

VM

OS

GPU

PCIePassthrough

driver

GPUdriver

App

App

App

Page 17: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Multiple GPU Contexts (Streams)

• Separate GPU execution contexts for compute kernels• Improved parallelism between streams

• Up to 32 streams

C True parallel executionD Limited to execution within a single process

Page 18: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Multi-Process Service (MPS)

• MPS Server implements full “virtual” CUDA interface for multiple client processes

C Multiple processes share GPUD Non-preemptive time-shared

schedulingD No isolation between data

OS ProcessMPS Client

CUDAContext

OS ProcessMPS Client

CUDAContext

MPS Server Process

GPUMPSClient

MPSClient

Page 19: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Multi-Process Service and Docker (MPS + Docker)

• MPS Server implements full “virtual” CUDA interface for multiple dockercontainers

C Multiple containers/VMs share GPUD Non-preemptive time-shared

scheduling between containersD No isolation between data

Docker ContainerMPS Client

CUDAContext

Docker ContainerMPS Client

CUDAContext

MPS Server Process

GPUMPSClient

MPSClient

Page 20: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Our Solution: Multiple MPI clients per MPS instance

• Use MPI to scale to more processes– Attach multiple MPI processes to a single MPS GPU instance

• Expand GPU parallelism using existing MPI parallelism

Page 21: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Using MPS with MPI

Communication Network

Message Passing Interface

Processes

1 2 3

Communication Network

Message Passing Interface

Processes

0 2 4 61 3 5 7… … … …

mps0 mps1 mps2 mps3

0

Page 22: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

MPS x MPI parallelism implementation

• Start up MPS instance for all GPUs (4 on our system)

• Increase # MPI workers

• Map (#MPI workers / #GPUs) MPI workers to a GPU – intercepted by MPS

Page 23: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Evaluating GPU sharing Efficiency

1. Multiple distinct training jobs sharing a single GPU – MPS

2. Single parallel/distributed training job on (multiple) GPU – MPS x MPI

Page 24: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Experiment setup

• Training: Language classification network• Network: Multi-layer CNN• Implementation: Torch-based• Training Set: 100s of intents,

1000s of sentencessmall batch size(s)

• Experiment: Measure time to train 200 epochswith increasing # of training jobs

IBM Power System S822LC• POWER8 CPUs• 2 sockets• 20 cores @ 3.5 GHz• 512 GB RAM (DDR3)• 2 K80 GPU cards• PCIe gen3 CPU↔GPU

“Firestone”

Page 25: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Training on Firestone (K80s)

20 mins27 mins

46 mins53 mins

62 mins

83 mins

102 mins

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

Training Time

Restricted to:• 1 GPU chip• 5 CPU cores

Page 26: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

20 mins27 mins

46 mins53 mins

62 mins

83 mins

102 mins

1.0x

1.5x

1.7x1.9x 1.9x

1.9x2.0x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

Training Time

Throughput Gain

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Firestone (K80s)

Page 27: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

12%

24%

45%

55%

65%

84%

100%

20 mins27 mins

46 mins53 mins

62 mins

83 mins

102 mins

1.0x

1.5x

1.7x1.9x 1.9x

1.9x2.0x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

CPU Utilization

Training Time

Throughput Gain

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Firestone (K80s)

Page 28: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

71%

88%92% 94% 94% 94%

94%

12%

24%

45%

55%

65%

84%

100%

20 mins27 mins

46 mins53 mins

62 mins

83 mins

102 mins

1.0x

1.5x

1.7x1.9x 1.9x

1.9x2.0x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

GPU Utilization

CPU Utilization

Training Time

Throughput Gain

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Firestone (K80s)

Page 29: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

How About Parallel/Distributed Training?

• Training: Language classification network• Network: Multi-layer CNN• Implementation: Torch-based, distributed algorithm using MPI• Training Set: 10s of intents, 100,000s of sentences• Experiment: Measure time to train 200 epochs

increasing the # MPI workers

Page 30: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

020406080

100120140160180

510

2032

64128

256512

52

3326

2321

2020

20

51

3426

2422

2120

20

59

3930

2623

2221

21

103

65

4537

3229

2828

162

136

116112

110109

109108

Trai

ning

Tim

e (h

ours

)

Batch Size

Training Time with Distributed Algorithm(s) – MPS x MPI

Page 31: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

How About the Latest Architecture?

IBM Power System S822LC• POWER8 CPUs• 2 sockets• 20 cores @ 3.5 GHz• 512 GB RAM (DDR3)• 2 K80 GPU cards• PCIe gen3 CPU↔GPU

IBM Power System S822LC for HPC• POWER8 CPUs• 2 sockets• 20 cores @ 3.5 GHz• 1 TB RAM (DDR4)• 4 P100 GPU chips• NVLink CPU↔GPU

“Firestone” “Minsky”

Page 32: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

12 mins

12 mins 17 mins19 mins

24 mins

35 mins

50 mins

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

Training Time

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Minsky (P100s)

Page 33: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

12 mins

12 mins 17 mins19 mins

24 mins

35 mins

50 mins

1.0x

2.0x

2.8x3.2x

3.0x2.7x

2.4x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

Training Time

Throughput Gain, 5 cores

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Minsky (P100s)

Page 34: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

12%

28%

52%

64%

72%

92%

100%

12 mins

12 mins 17 mins19 mins

24 mins

35 mins

50 mins

1.0x

2.0x

2.8x3.2x

3.0x2.7x

2.4x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

CPU Utilization

Training Time

Throughput Gain, 5 cores

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Minsky (P100s)

Page 35: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

49%

78%

95% 97%92%

87%77%

12%

28%

52%

64%

72%

92%

100%

12 mins

12 mins 17 mins19 mins

24 mins

35 mins

50 mins

1.0x

2.0x

2.8x3.2x

3.0x2.7x

2.4x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

GPU Utilization

CPU Utilization

Training Time

Throughput Gain, 5 cores

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Minsky (P100s)

Page 36: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

49%

78%

95% 97%92%

87%77%

12%

28%

52%

64%

72%

92%

100%

12 mins

12 mins 17 mins19 mins

24 mins

35 mins

50 mins

1.0x

2.0x

2.8x3.2x

3.0x2.7x

2.4x

3.3x

3.6x 3.8x

1 2 3 4 5 6 7 8 9 10Number of Training Jobs

GPU UtilizationCPU UtilizationTraining TimeThroughput Gain, 5 coresThroughput Gain, 10 cores

Restricted to:• 1 GPU chip• 5 CPU cores

Training on Minsky (P100s)

Page 37: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Conclusions

Page 38: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Conclusions• Optimization is a multidimensional space – need to consult multiple(!)

metrics

Page 39: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Conclusions• Optimization is a multidimensional space – need to consult multiple(!)

metrics

• GPU sharinga) increases system efficiency by running multiple training jobs b) reduces training time by enabling multiple #MPI workers per GPU

Page 40: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Conclusions• Optimization is a multidimensional space – need to consult multiple(!)

metrics

• GPU sharinga) increases system efficiency by running multiple training jobs b) reduces training time by enabling multiple #MPI workers per GPU

• Sharing GPUs yields >3x efficiency gainsðoff-the-shelf 4 GPU workstation becomes a 12 GPU DL supercomputer

Page 41: Optimizing Efficiency of Deep Learning Workloads …on-demand.gputechconf.com/gtc/2017/presentation/S7320...Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Conclusions• Optimization is a multidimensional space – need to consult multiple(!)

metrics

• GPU sharinga) increases system efficiency by running multiple training jobs b) reduces training time by enabling multiple #MPI workers per GPU

• Sharing GPUs yields >3x efficiency gainsðoff-the-shelf 4 GPU workstation becomes a 12 GPU DL supercomputer

The full details of this work will appear, in final form, in the IBM Journal of Research and Development, vol. 61, no. 4/5, 2017, as part of a special issue on “Deep Learning.” Please cite the IBM Journal official paper version of record. For more information on the journal, see: http://www.research.ibm.com/journal/.