new techniques for programming gpu clusters yifeng chen school of eecs peking university, china

New Techniques for Programming GPU Clusters

Yifeng Chen

School of EECSPeking University, China.

Two Conflicting Approaches for

Programmability in HPCTop-down Approach

Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual

challenge, but Shorter code

GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame ： 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

ne two rk

Motivating Examples for PARRAY

Basic Notation

Dimension Tree

Type Reference

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

ne two rk

Thread Arrays

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}

pthread_create

sem_postsem_post

sem_waitsem_wait

pthread_joinpthread_join

Generating CUDA+Pthread

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}

Generating MPI or IB/verbs

MPI_ScatterMPI_Scatter

ALLTOALL

Other Communication Patterns

Generating Code for IB/verbs and YH

Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication

pattern achieving Zero-Copy.

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows

Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes

Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.

Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable

computation Conclusion: GPU-accelerated large simulation on entire

Tianhe-1A is feasible.

Generated Code

DiscussionsOther programming models?

MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)

We need a software stack! Irregular structures must be encoded into arrays

and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future

support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…

new techniques for programming gpu clusters yifeng chen school of eecs peking university, china

Documents

advanced compiler techniques liu xianhua school of eecs,...

hadoop technical review course/cs402 peng bo school of eecs,...

optimized projections for compressed sensing via direct...

peking opera

yifeng zhang, ming jiang, qi zhao university of …saliency...

abschlussbericht: auslandsjahr an der peking universität...

improved error estimate for extraordinary catmull-clark...

page: a partition aware graph computation engine yingxia...

2016 necina financial technology conference€¦ ·...

net.pku/~course/cs410/2011/ hongfei yan school of eecs,...

peking marion rieber

background knowledge course/cs402 peng bo school of eecs,...

1898-2006 1 an introduction to china’s science & tech....

peking review

parallel subgraph listing in a large-scale graph yingxia...

ucla engineering computer science school of eecs, peking...

yifeng hadoop-present-public

yifeng spark-final-public

3dvcr group, department of machine intelligence *yipu zhao,...

cofm: an environment for collaborative feature modeling li...