managing gpu accelerated computing - nvidia · 2019-09-12 · accelerated applications - gpu...

13
Aug 2019 - Shankar Chandrasekaran MANAGING GPU ACCELERATED COMPUTING

Upload: others

Post on 20-May-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

Aug 2019 - Shankar Chandrasekaran

MANAGING GPU ACCELERATED COMPUTING

Page 2: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

2

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

Page 3: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

3

END-TO-END

SOFTWARE STACK

RECORD-SETTINGPERFORAMNCE

AVAILABLE EVERYWHERE

NVIDIA GPU PLATFORM FOR ACCELERATING AI

Cloud Services

Systems

6 ML Perf Training Records

AWS SageMaker

GCP ML Engine

AzureML

Time Machine for AI (Training RN-50)

2015TESLA K80 | CUDA

2017DGX-1 | VOLTA | TENSOR CORES

2018DGX-2 | VOLTA | NVSWITCH

2019DGX SUPERPOD | VOLTA | MELLANOX IB

36,000 Mins (25 Days)

480 Mins (8 Hrs)

63 Mins

<2 Mins

Page 4: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

4

Server (& rest of infra) Bring-up &

Provisioning

Virtualization Provisioning

Container Orchestration

Application Deployment & Management

App/Infra Monitoring

Error handling & remediation

INFRASTRUCTURE WORKFLOW FOR AI

Page 5: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

5

RUNNING AI & DATA SCIENCE JOBS WITH CONFIDENCE

NVIDIA DATA

CENTER GPU

MANAGER

Parallel monitoring & management ecosystem for GPU, memory, NVSWITCH,

baseboard components

Active health monitoringDiagnostics

Power and clock management

OUT OF BAND

MANAGEMENT

Error handling & remediation

Server (& rest of infra) Bring-up &

Provisioning

Performance validated NVIDIA GPU servers for

faster rollout in production

NGC READY

SERVERS

Hardware Reliability & Management

Page 6: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

6

NVIDIA VIRTUAL COMPUTE SERVERGPU Acceleration Features for Server Virtualization

Multi-VMs per GPU (Sharing)

NVIDIA NGC(Containers)

ECC & Page Retirement

Peer-to-Peer over NVLink

Multi-vGPU per VM(Aggregate)

New Features for vComputeServer

Vsphere For Management, Monitoring & Migration

Enhanced, Flexible Scheduling

Virtualization Provisioning

Page 7: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

7

NVIDIA GPUs in Kubernetes

• Simplify large scale deployments of GPU-

accelerated applications - GPU support in

Kubernetes using the NVIDIA device plugin

• Specify GPU attributes such as GPU type and

memory requirements for deployment in

heterogeneous GPU clusters

• Visualize and monitor GPU metrics and health

with an integrated GPU monitoring stack

of NVIDIA DCGM , Prometheus and Grafana

7

Container Orchestration

App/Infra Monitoring

Application Deployment & Management

NVIDIA GPUs

NVIDIA Container Runtime

KUBERNETES GPU plugin

NGC Containers

Docker

Page 8: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

8

AI DEVELOPMENT AND DEPLOYMENT

Data scientists Developers IT/DevOps

Trained ModelsApps with

trained Models

New data to update models

Data Preprocessing

LabelingModel

Development & Evaluation

Train @scale OptimizationDeployment &

Monitoring

Page 9: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

9

PLATFORM BUILT FOR TRAININGAccelerating Every Framework And Fueling Innovation

All Major FrameworksAll Use-cases

Speech Video

Translation Personalization

Volta Tensor Core, NVSwitch, NVLink

Tensor Cores

NVLink NVSwitch

Page 10: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

10

App 1

App 2

AI Model

Repository

AI Inference Cluster

CPU | GPUFront End Client

ApplicationsTensorRT Inference

Server App

TensorRT Inference Server App

TensorRT Inference Server App

TensorRT Inference Server App

INFERENCE WITH TENSORRT INFERENCE SERVER

Cloud| Data centerGPU | CPU

TensorFlow | TensorRT Plan | PyTorch | Caffe | Custom

Any framework

Any platform

Page 11: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

11©2018 VMware, Inc.

NVIDIA NGC – 150+ CONTAINERS, PRE-TRAINED MODELS, TRAINING SCRIPTS AND WORKFLOWS

Page 12: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as

12

KEY TAKEAWAYS

• NGC Ready servers, DCGM

• vComputeServer for vSphere environments

• Kubernetes for container orchestration on GPUs

• AI Training: GPU optimized software for model development and training

• AI Inference: TensorRT Inference Server or vComputeServer fractional GPUs

GPU Platform Check list

www.nvidia.comngc.nvidia.com

Page 13: MANAGING GPU ACCELERATED COMPUTING - NVIDIA · 2019-09-12 · accelerated applications - GPU support in Kubernetes using the NVIDIA device plugin • Specify GPU attributes such as