rcuda 4: gpgpu as a service inrcuda 4: gpgpu as a service in hpc … · 2013. 7. 5. ·...

rCUDA 4: GPGPU as a service inrCUDA 4: GPGPU as a service in HPC clusters

Rafael MayoAntonio J. PeñaAntonio J. Peña

Universitat Jaume I, Castelló (Spain)

Joint research effort

Outline

GPU computing GPU computing GPU computing scenarios The wish list Introduction to rCUDA Introduction to rCUDA rCUDA functionality rCUDA v3 (stable) rCUDA v4 (alpha) rCUDA v4 (alpha) Current status

HPC Advisory Council Spain Conference 2012 2/72

Outline



GPU computing GPU computing: defines all the technological issues for

using the GPU computational power for executing l dgeneral purpose code

GPU computing has experienced remarkable growth in th l tthe last years


GPU computing

The basic building block is a node with 1 or more GPUsPU em

Main Memory

PU em

CPU

GPU

GP

me

Net

wor

k

GPU

GP

me

PCI-e

U m U m U m U m Main Memory

GPU

GPU

mem

GPU

GPU

mem

GPU

GPU

mem

GPU

GPU

mem Main Memory

CPU

Net

wor

kN

PCI-e


GPU computing

From the programming point of view: A set of nodes, each one with:

one or more CPUs (with several cores per CPU) one or more GPUs (1-4)

An interconnection network

GPU GPUmem GPU GPU

memGPU GPUmem GPU GPU

mem GPU GPUmem GPU GPU

mem

PCI-e C

PU

Main

Mem

o

GPU GPUmem

PCI-e C

PU

Main

Mem

o

GPU GPUmem

PCI-e C

PU

Main

Mem

o

GPU GPUmem

PCI-e C

PU

Main

Mem

o

GPU GPUmem

PCI-e C

PU

Main

Mem

o

GPU GPUmem

PCI-e C

PU

Main

Mem

o

GPU GPUmem

ory

Network

Interconnection Network

ory

Network

ory

Network

ory

Network

ory

Network

ory

Network



GPU computing

Development tools have been introduced in porder to ease the programming of the GPUsTwo main approaches in GPU computing Two main approaches in GPU computing development environments: CUDA NVIDIA proprietary OpenCL open standard OpenCL open standard


GPU computing

Basically CUDA and OpenCL have the same working scheme:working scheme: Compilation: Separate CPU code from GPU

code (GPU kernel)code (GPU kernel) Execution:

D t t f CPU d GPU Data transfers: CPU and GPU memory spaces1.Before GPU kernel execution: data from CPU

memory space to GPU memory space2.Computation: Kernel execution3.After GPU kernel execution: results from GPU

memory space to CPU memory space


GPU computing

Time spent on data transfers may not be negligible

90

100Influence of data transfers for SGEMM

Pinned MemoryNon-Pinned Memory

%)

60

70

80y

rans

fers

(%

40

50

60

ed to

dat

a tr

10

20

30

Tim

e de

vote

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

10

Matrix Size

T


Matrix Size

Outline



GPU computing scenarios

For the right kind of code the use of GPUs brings huge benefits in terms of performance and energybenefits in terms of performance and energy

There must be data parallelism in the code: this is the l t t k b fit f th h d d fonly way to take benefit from the hundreds of processors

in a GPU Different scenarios from the point of view of the

application: Low level of data parallelism High level of data parallelism Moderate level of data parallelism Applications for multi-GPU computing


Applications for multi GPU computing


Low level of data parallelismRegarding GPU computing?

No GPU is needed just proceed with the tradicional HPCNo GPU is needed, just proceed with the tradicional HPC strategies

High level of data parallelismg pRegarding GPU computing?

Add one or more GPUs to every node in the system and rewrite applications to use them



Moderate level of data parallelism: Application presents a data parallelism around [40%-80%]

Regarding GPU computing?

presents a data parallelism around [40% 80%]

The GPUs in the system are used only for some parts of the application, remaining idle the rest of the time and, thus, wasting resources and energy



Applications for multi GPU computing Applications for multi-GPU computingAn application can use a large amount of GPUs

i ll lin parallel

Regarding GPU computing?The code running in a node can only access the GPUs in that node, but it would run faster if it could have access to more GPUsto more GPUs


Outline



The wish list

+ GPU+ GPU+ GPU

Current trend in GPU deploymentEach node one or more GPUs

CONCERNS:

+ GPU+ GPU+ GPU+ GPU

Each node one or more GPUs

For applications with moderate levels of data parallelism, the GPUs in the cluster may be idle for


long periods of time

Multi-GPU applications cannot make use of the


tremendous GPU resources available across the cluster


+ GPU

+ GPU+ GPU+ GPU Deployment and energy costs!


The wish list

A way of addressing the first concern, the energy concern is by reducing the number

+ GPU

energy concern, is by reducing the number of GPUs present in the cluster + GPU

+ GPUNEW CONCERN:

+ GPU

NEW CONCERN: A lower amount of GPUs noticeably increases

the difficulty of efficiently scheduling jobs

+ GPU

the difficulty of efficiently scheduling jobs (considering global CPU and GPU requirements)

+ GPU


The wish list

+ GPU A way of addressing the new scheduling

problem is by reducing the number of+ GPU

problem is by reducing the number of GPUs present in the cluster and sharing the remaining ones among the CPU nodes

+ GPU

the remaining ones among the CPU nodes

Additionally, by appropriately sharing GPU lti GPU ti i l

+ GPU

GPUs, multi-GPU computing is also feasible

+ GPU

This would increase GPU utilization, also lowering power consumption, at the same i h i i i l i i i + GPUtime that initial acquisition costs are reduced


The wish list

+ GPU Efficiently sharing GPUs across a cluster

can be achieved by leveraging GPU+ GPU

can be achieved by leveraging GPU virtualization

C A f k

+ GPU

rCUDA

gVirtuS

As far as we kwon, rCUDA is the only remote

virtualization solution ith CUDA 4 s pport

+ GPU

vCUDA

GViM

with CUDA 4 support

+ GPU

GViM

VGPU + GPU

GridCuda


The wish list

Going even beyond: consolidating GPUs into dedicated consolidating GPUs into dedicated

servers (no CPU power) and allowing GPU task migrationg g

TRUE GREEN GPU COMPUTING


The wish list

GPUs are disaggregated from CPUsandand

global schedulers are enhanced


Outline



Introduction to rCUDA

A framework enabling that a CUDA-based A framework enabling that a CUDA based application running in one node can

access GPUs in other nodes

It is useful when you have:

Moderate level of data parallelism

A li ti f lti GPU ti Applications for multi GPU computing




GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

yNetwork

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network


Adding GPUs at each node makes some GPUs to remain idle for long periods. This is a waste of money and energy




GPU mem

GPU mem

GPU mem

GPU mem

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

yNetwork

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network


Add only the GPUs that are needed considering application requirements and their level of data parallelism and




GPU mem

GPU mem

GPU mem

GPU mem

Logical connections

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

yNetwork

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network


Add only the GPUs that are needed considering application requirements and their level of data parallelism and make all of them accessible from every node



Applications for multi-GPU computingpp p g

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

yNetwork

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network


From a given CPU it is only possible to access the corresponding GPUs in that very same node



Applications for multi-GPU computing

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

pp p g

Logical connections

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

yNetwork

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network


Make all GPUs accessible from every node



Applications for multi-GPU computing

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

GPU mem

pp p g

Logical connections

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

yNetwork

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network

CPU

PCI-e

Mai

nM

emor

y

Network


Make all GPUs accessible from every node and enable the access from a CPU to as many as required GPUs


Outline



rCUDA functionality

CUDA application

Application

CUDAlibraries


rCUDA functionality

CUDA applicationClient side Server side

ApplicationApplication

CUDAdriver + runtime

CUDAlibraries


rCUDA functionality


Application rCUDA daemon

CUDAlibraries

rCUDA library

Networkdevice

Networkdevicedevice device


rCUDA functionality




rCUDA library

Networkdevice



rCUDA functionality

rCUDA uses a proprietary communication protocolcommunication protocol

Example:

1) initialization2) memory allocation on the

remote GPUremote GPU3) CPU to GPU memory

transfer of the input data 4) kernel execution4) kernel execution5) GPU to CPU memory

transfer of the results 6) GPU memory release) y7) communication channel

closing and server process finalization


p

Outline



rCUDA v3 (stable)

Current stable release

Uses TCP/IP stack

Runs over all TPC/IP networks

Presents some non-negligible overhead due to the Presents some non-negligible overhead due to the

use of TCP/IP


rCUDA v3 (stable)Execution time for matrix-matrix multiplication (GEMM)

T l C1060T l C1060

60

70Tesla C1060Intel Xeon E5410 (8 cores)Gigabit Ethernet

Tesla C1060Intel Xeon E5410 (8 cores)Gigabit Ethernet CPU

40

50

kernexecu

rCUDA

(sec

)

20

30

eltionD

tra

Tim

e

0

10

Data

ansfers

512

2048

4096

6144

8192

1024

012

288

1433

616

384

M t i di iJ. Duato, F. D. Igual, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, and F. Silla, “An efficient implementation


Matrix dimensionof GPU virtualization in high performance clusters”, in Euro-Par 2009, Parallel Processing –Workshops, 6043, pp. 385-394, Lecture Notes in Computer Science, Springer-Verlag, 2010.

Outline



rCUDA v4 (alpha) Low-level InfiniBand support

InfiniBand is the most used HPC network InfiniBand is the most used HPC network Low latency and high bandwidth


Top500 June

rCUDA v4 (alpha)

Same user level functionality: CUDA Runtime Use of IB-Verbs

All TCP/IP stack overflow is out• All TCP/IP stack overflow is out Bandwidth client to/from remote GPU near

th k I fi iB d t k b d idththe peak InfiniBand network bandwidth Use of GPUDirect

Reduce the number of intra-node data movements Use of pipelined transfers Use of pipelined transfers

Overlap intra-node data movements and transfers

GPUDirect RDMA support is coming


GPUDirect RDMA support is coming

rCUDA v4 (alpha)

Remote GPU Transfers (GPUDirect)


rCUDA v4 (alpha)



rCUDA v4 (alpha)

Remote GPU Transfers (GPUDirect RDMA) CUDA 5, upcoming


rCUDA v4 (alpha)



rCUDA v4 (alpha)

Bandwidth for a matrix of 4096 x 4096 single precision

5000

Bandwidth for a matrix of 4096 x 4096 single precision

350040004500

B/s

ec) QDR InfiniBand

2000250030003500

wid

th (M

B

IB peak bandwidth 2900 MB/sec

500100015002000

Ban

dw

rCUDA GigaE rCUDA IPoIB rCUDA IBVerbs CUDA0

500

J D t A J P ñ F Sill R M E S Q i t O tí CUDA I fi iB d


J. Duato, A. J. Peña, F. Silla, R. Mayo, E. S. Quintana-Ortí, rCUDA InfiniBandperformance, Poster, 2011. International Supercomputing Conference.

rCUDA v4 (alpha)

Execution time for a matrix-matrix product

45

50Tcpu Tgpu TrCUDA Tesla C2050

Intel Xeon E5645QDR InfiniBand

30

35

40

20

25

30

Tim

e (s

)

5

10

15

T

100 2100 4100 6100 8100 10100600

11001600 2600

31003600 4600

51005600 6600

71007600 8600

91009600 10600

1110011600

1210012600

1310013600

1410014600

0


Matrix dimension

rCUDA v4 (alpha)

Overhead time for a matrix-matrix product

0,9

1% overhead gpu % overhead rcuda

Tesla C2050Intel Xeon E5645QDR InfiniBand

0 6

0,7

0,8QDR InfiniBand

0,4

0,5

0,6

ove

rhea

d

0 1

0,2

0,3%

100 2100 4100 6100 8100 10100600

11001600 2600

31003600 4600

51005600 6600

71007600 8600

91009600 10600

1110011600

1210012600

1310013600

1410014600

0

0,1


100 2100 4100 6100 8100 101001100 3100 5100 7100 9100 11100121001310014100

Matrix dimension

rCUDA v4 (alpha)Execution time for the LAMMPS application, in.eam input scriptscaled by a factor of 5 in the three dimensions

Tesla C2050I t l X E5520Intel Xeon E5520QDR InfiniBand

C. Reaño, A. J. Peña, F. Silla, J. Duato, R. Mayo, and E. S. Quintana-Ortí. “CU2rCU: towards the


complete rCUDA remote GPU virtualization and sharing solution”, in Proceedings of the 2012 International Conference on High Performance Computing (HiPC 2012), Pune, India, Dec. 2012.

rCUDA v4 (alpha)

Multi-thread capabilities (thread-safe)


rCUDA v4 (alpha)

Multi-server capabilities


rCUDA v4 (alpha)

Multi-thread and multi-server capabilities


rCUDA v4 (alpha)

GPUDirect 2.0 emulation is coming


rCUDA v4 (alpha)

Multi-thread and multi-server performance




rCUDA v4 (alpha) rCUDA does not support the CUDA extensions to C In order to execute a program within rCUDA, the CUDA p g ,

extensions included in its code must be “unextended” to the plain C API

NVCC inserts calls to undocumented CUDA functions


rCUDA v4 (alpha)

CU2rCU tool


rCUDA v4 (alpha)




Outline



Current Status

How compatible with CUDA is rCUDA?


Current StatusCUDA Runtime API functions implemented by rCUDA

rCUDA does not provide support for graphic functions


Current StatusNVIDIA CUDA C SDK code samples included in rCUDA SDK

SDK samples using graphic functionsor

driver API functions (not suppported)driver API functions (not suppported)


Current Status

rCUDA v3 2 stable release rCUDA v3.2 stable release Non-pipelined TCP/IP transfers CUDA 4.2 support:

Only excluding graphics interoperability Not thread-safe Single rCUDA server per application Only one GPU module per application allowed CUDA C extensions unsupportedCUDA C extensions unsupported rCUDA 3.0a experimental build for MS Windows


Current Status

rCUDA v4 alpha features (experimental) rCUDA v4 alpha features (experimental) CUDA 4.2 (excluding graphics interoperability) TCP + low-level InfiniBand communications GPU & network pipelined transfers (GPUDirect) Multithread support for applications (thread-safe) Multiserver capabilitiesp Support for CUDA C extensions (CU2rCU) Multiple GPU modules allowed Multiple GPU modules allowed MPI integration (MVAPICH + OpenMPI)


Current Status

What’s coming? What s coming? CUDA 5 Inter-node GPUDirect 2 emulation GPUDirect RDMA rCUDA integration into SLURM Full Runtime API support:pp

Graphics interoperability rCUDA 4 for MS WindowsrCUDA 4 for MS Windows …


http://www.rcuda.net

Free binary distribution (rCUDA v3.2 + rCUDA v4.0a): o Fedorao Fedorao Scientific Linux (RHEL compatible)o Ubuntuo OpenSuseo MS Windows

http://www.rcuda.net

Antonio Peña Ad i C t llórCUDA Team

Antonio PeñaCarlos ReañoFederico Silla

José Duato

Adrian CastellóRafael MayoEnrique S. Quintana-Ortí

Parallel Architectures

High Performance Computing and

A hi G

Thanks to Mellanox and AIC for their kindly support to this work

Group Architectures Group

Thanks to Mellanox and AIC for their kindly support to this work

rcuda 4: gpgpu as a service inrcuda 4: gpgpu as a service in hpc … · 2013. 7. 5. ·...

Documents