new techniques for programming gpu clusters yifeng chen school of eecs peking university, china

Post on 06-Feb-2016

28 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China. Two Conflicting Approaches for Programmability in HPC. Top-down Approach Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization - PowerPoint PPT Presentation

TRANSCRIPT

New Techniques for Programming GPU Clusters

Yifeng Chen

School of EECSPeking University, China.

Two Conflicting Approaches for

Programmability in HPCTop-down Approach

Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual

challenge, but Shorter code

GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1

CPU

GPU 0

GPU 1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

PC I

ne two rk

Motivating Examples for PARRAY

Basic Notation

Dimension Tree

Type Reference

GPU 0

GPU 1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

PC I

ne two rk

Thread Arrays

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}

pthread_create

pthread_create

sem_postsem_post

sem_waitsem_wait

pthread_joinpthread_join

Generating CUDA+Pthread

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}

Generating MPI or IB/verbs

MPI_ScatterMPI_Scatter

ALLTOALL

BCAST

Other Communication Patterns

Generating Code for IB/verbs and YH

Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication

pattern achieving Zero-Copy.

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows

Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes

Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.

Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable

computation Conclusion: GPU-accelerated large simulation on entire

Tianhe-1A is feasible.

Generated Code

DiscussionsOther programming models?

MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)

We need a software stack! Irregular structures must be encoded into arrays

and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future

support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…

top related