accelerating & optimizing hpc/ml on vsphere leveraging ...€¦ · apps mobile analytics/ saas...

31
1 ©2018 VMware, Inc. Accelerating & Optimizing HPC/ML on vSphere Leveraging NVIDIA GPU Mohan Potheri, VMware, Inc Justin Murray, VMware, Inc

Upload: others

Post on 25-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

1©2018 VMware, Inc.

Accelerating & Optimizing HPC/ML on vSphere Leveraging NVIDIA GPU

Mohan Potheri, VMware, Inc

Justin Murray, VMware, Inc

Page 2: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Agenda

2©2018 VMware, Inc.

New Demands on IT

VMware Goal and Approach

Why Virtualize AI & ML

Machine Learning Landscape

Maximizing GPU Utilization

Extending GPU Sharing to Containers

Summary

Page 3: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

3©2018 VMware, Inc.

New Demands on IT Infrastructure

X86 SGXGPU NVM FPGAQAT IPU

Specialized Hardware

Security

Hybrid Cloud

Public Cloud

Global Infra and Edge

Growth of Apps

BusinessCritical Apps

DesktopVirtualization

Graphic Intensive

Cloud-NativeApps

Edge/IOTSaaSMobile Custom/OtherAnalytics/AI/ML

PMEM

Page 4: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Our Goal and Approach

• Increase agility and decrease time to discovery for researchers, data scientists, and engineers

• Provide IT with the ability to efficiently provision, allocate, manage and ensure compliance of research compute infrastructure across an increasingly broad range of technical and business requirements

• By leveraging VMware’s proven, enterprise-class virtualization and cloud technologies to meet the performance requirements of research computing, HPC, and ML workloads, and

• Bringing novel capabilities to bear to enable new capabilities not available in traditional HPC/ML environments

Page 5: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

5©2018 VMware, Inc.

• Simple cluster expansion and contraction

• Rapidly reproduce research environments

• Higher resiliency and less downtime with vMotion

• Fault-isolation (hardware and software)

• Cluster resource-sharing

• Minimize setup and configuration time with centralized management capabilities

• Simultaneously support mixed software environments

• Industry-leading virtualization platform that your IT already knows

• Easy, secure data access and sharing

• Security Isolation

• Multi-tenant data security

Why Virtualize HPC AI/ML InfrastructurevSphere can help data scientists get to answers faster

Operational Flexibility Reduced Complexity Secure Sensitive Workloads

Page 6: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

6©2018 VMware, Inc.

Dispelling the Misunderstanding about GPUs on vSphere

• Hypervisor is not an intermediary when accessing the GPU

• GPU access is

• Directly via passthrough to VM

or

• NVIDIA Grid vGPU

• Near Zero performance impact

Page 7: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

7©2018 VMware, Inc.

MachineLearning

DeepLearningBig Data

EdgeorIoT

ON-PREM

OFF-PREM

trainingdata

inference

inference

Machine Learning Infrastructure Landscape

Data Analytics

Two Main Phases in ML

• Training / Model Building

• Often very large data sets

• Compute, storage, and network intensive

• Server-class infrastructure

• Inference / Scoring

• Apply existing models to new data

• Used for prediction

• Edge or core infrastructure

V

D

I

Page 8: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

8©2018 VMware, Inc.

Using GPUs with vSphere

Page 9: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

9©2018 VMware, Inc.

VM Direct Path I/O for NVIDIA GPU

Page 10: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

10©2018 VMware, Inc.

A Virtualized GPU

PassThrough v Sphere 6.5/6.7

ESXi Host

GPU

VM VM

LinuxCUDA Library & Driver

TensorFlow

Page 11: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

11©2018 VMware, Inc.

• Can provision VMs with one or more GPUs

• Easily reuse GPU infrastructure

• Same behavior as Public Cloud GPU instances

• Benefits:

• HW Isolation

• Workload Isolation

• VM Level Quality of Service

• Fast environment provisioning

• Near bare-metal performance

• Passthrough device certification for vSphere not required

• Server must be compatible with device as published by server OEM and GPU vendor

• Server must be vSphere Certified

GPU Acceleration on vSphere with DirectPath I/O

VMGPU App

GPU App

GPU App

GPU App

GPU App

• Caveats:

• No vMotion

• No Suspend and Resume

• No DRS

• No vSphere HA

Learn more

Page 12: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

12©2018 VMware, Inc.

VM DirectPath I/O – Multiple GPUs Attached to a Virtual Machine

Page 13: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

13©2018 VMware, Inc.

vSphere GPU Sharing Mechanisms

Page 14: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

14©2018 VMware, Inc.

Using GPUs with vSphere

Page 15: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

15©2018 VMware, Inc.

• Share single GPU among multiple VMs

• Provision VMs with partial up to one full GPU

• GRID vGPU VM Suspend and Resume support

• Quickly repurpose GPU infrastructure

• VDI or Data Science by day

• Compute (ML) by Night

• Benefits:

• HW Isolation

• Workload Isolation

• VM Level Quality of Service

• GPU Quality of Service

• Fast environment provisioning

• Bare-metal comparable performance

VMware vSphere 6.7 and NVIDIA Quadro vDWS (GRID 7.0)

GPU App

GPU App

GPU App

GPU App

GPU App

GPU App

GPU App

GPU App

Learn more

Page 16: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

16©2018 VMware, Inc.

NVIDIA Grid – Two Layers of Software/Drivers

Page 17: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

17©2018 VMware, Inc.

NVIDIA Grid Configuration – Choosing the vGPU Profile

Page 18: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

18©2018 VMware, Inc.

Using GPUs with vSphere

Page 19: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

19©2018 VMware, Inc.

• Dynamic GPU attach anywhere

• Fractional GPUs for Efficiency

• Application Run Time Virtualization

• Standard based GPU

Bitfusion Enables Remote GPU Sharing

BF Client VM

ESX Host

BF Server VM

ESX Host

GPU Passthrough

BF Server VM

ESX Host

GPU Passthrough

BF Server VM

ESX Host

GPU Passthrough

vSphere GPU Cluster

BF Client VM

ESX Host

BF Client VM

ESX Host

BF Client VM

ESX Host

Page 20: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

20©2018 VMware, Inc.

Maximize GPU Utilization

Page 21: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

21©2018 VMware, Inc.

vSphere 6.7 GPU Virtual Machine Suspend and Resume

Source: Enhancing Operations for NVIDIA Grid

Video Demo:

https://youtu.be/PwVReRauY50

Blog Article:

https://blogs.vmware.com/vsphere/2018/07/vsphere-6-7-suspend-and-resume-of-gpu-attached-virtual-machines.html

Page 22: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

22©2018 VMware, Inc.

Go beyond a traditional batch-processing to viewing HPC resources as an engine for returning results in real time.

Enable HPC compute jobs to harvest cycles from a VDI compute environment.

Outcome

Benefit

Deep Learning Virtualization Use Case: Cycle Harvesting

Challenge:

Data Scientists submit jobs in traditional batches, because of compute availability• Submit jobs one day• Wait until the next day for the job results

What if…The VDI environment has unused cycles. Could HPC jobs be run in the environment when it is not needed to run VDI?

Will it blend?

Page 23: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

23©2018 VMware, Inc.

Cycle Harvesting

VMware ESXi VMware ESXi VMware ESXi

100 100 100 100 100 100 1 1Share Value 100

8AMTime Noon 5PM 10PM

1

Page 24: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

24©2018 VMware, Inc.

Cycle Harvesting Case Studyhttps://bit.ly/2MrBngH

Page 25: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

25©2018 VMware, Inc.

Extending GPGPU Sharing to Containers

Page 26: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Why Singularity Containers?

Docker is not designed for HPC architectures

Singularity is the best suited Container solution for HPC:

Singularity container is encapsulated in a single file making it highly portable and secure.

Singularity is designed from the ground up for scientific computing

Page 27: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Combining Virtual Machines & Containers for GPU sharing

• Sharing GPUs in a container is difficult as there is no resource management

• vSphere VM with NVIDIA Grid or Bitfusion can use whole or partial GPU

• Containers are a great packaging mechanism for applications

• By enclosing one container per virtual machine, we get the best of both worlds• GPU resources can be shared with other containers

• Machine and Deep Learning applications & platforms can be packaged and distributed effectively as a container

Page 28: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Logical Schematic of Infrastructure components

• One Singularity Container per VM

• Containers leverage partial or full GPUs allocated to the virtual machine

• Container packaged with TensorFlow, tools, etc.

• Bitfusion provides GPU sharing

BF Server

VM

ESX Host

GPU Passthrough

BF Server

VM

ESX Host

GPU Passthrough

BF Server

VM

ESX Host

GPU Passthrough

vSphere GPU Cluster

Singularity Container

Virtual Machine

ESX Host

Singularity Container

Virtual Machine

ESX HostvSphere Generic Cluster

Page 29: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Images/sec Throughput comparison for 1 GPU

2.5-3X more throughput with sharing

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

Resnet50 Alexnet Inception3

Throughput comparison with and without GPU sharing

Total Throughput Baseline no sharing

Thro

ugh

pu

t R

atio

s

Page 30: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Runtime comparison for 1 GPU (with/without sharing)

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

200.00

Runtime (%) Average Run Time (Seconds)

Runtime comparison for 1 GPU with and without sharing

Unshared Shared

17%

Only 17% slower for nearly 3X Throughput

Page 31: Accelerating & Optimizing HPC/ML on vSphere Leveraging ...€¦ · Apps Mobile Analytics/ SaaS Edge/IOT Custom/Other ... • By leveraging VMware’s proven, enterprise-class virtualization

Summary

• Sharing is key to enable cloud like capabilities on premises

• vSphere is the best platform to leverage latest high performance hardware

• Virtualization supports device sharing and delivers near bare-metal performance

• HW Sharing through vSphere can increase utilization. (Cycle Harvesting)