slide 1 using openacc in ifs physics’ cloud scheme (cloudsc) sami saarinen ecmwf basic gpu...

Using OpenACC in IFS Physics’Cloud Scheme (CLOUDSC)

Sami Saarinen

ECMWF

Basic GPU Training

Sept 16-17, 2015

Background

Back in 2014 : Adaptation of IFS physics’ cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme

Emphasis was on GPU-migration by use of OpenACC directives

CLOUDSC consumes about 10% of IFS Forecast time Some 3500 lines of Fortran2003 – before OpenACC directives

This presentation concentrates comparing performances on Haswell – OpenMP version of CLOUDSC

NVIDIA GPU (K40) – OpenACC version of CLOUDSC

Some earlier results

Baseline results down from 40s to 0,24s on K40 GPU PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014)

Also Cray CCE 8.4 OpenACC-compiler was tried

OpenACC directives inserted automatically By use of acc_insert Perl script followed by manual cleanup

Source code lines expanded from 3500 to 5000 in CLOUDSC !

The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side

GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz)

Data transfer added serious overheads

Strange DATA PRESENT testing & memory pinning slowdowns

The problem setup for this case study

Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV)

About 80,000 columns fit into one K40 GPU

Grid point columns are independent of each other So no horizontal dependencies here, but ...

... level dependency prevents parallelization along vertical dim

Arrays are organized in blocks of grid point columns

Instead of using ARRAY(NGPTOT, NLEV) ...

... we use ARRAY(NPROMA, NLEV, NBLKS) NPROMA is a (runtime) fixed blocking factor

Arrays are OpenMP thread safe over NBLKS

Hardware, compiler & NPROMA’s used

Haswell-node : 24-cores @ 2.5GHz

2 x NVIDIA K40c GPUs on each Haswell-node via PCIe Each GPU equipped with 12GB memory – with CUDA 7.0

PGI Compiler 15.7 with OpenMP & OpenACC –O4 –fast –mp=numa,allcores,bind –Mfprelaxed

–tp haswell –Mvect=simd:256 [ -acc ]

Environment variables PGI_ACC_NOSHARED=1

PGI_ACC_BUFFERSIZE=4M

Typical good NPROMA value for Haswell ~ 10 – 100

Per GPUs NPROMA up to 80,000 for max performance

Haswell : Driving CLOUDSC with OpenMP

REAL(kind=8) :: array(NPROMA, NLEV, NGPBLKS)

!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND)!$OMP DO SCHEDULE(DYNAMIC,1)

DO JKGLO=1,NGPTOT,NPROMA ! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA

CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,IBL), & ! ~ 65 arrays like this ) END DO

!$OMP END DO!$OMP END PARALLEL

Typical values for NPROMA in OpenMP

implementation:10 – 100

OpenMP scaling (Haswell, in GFlops/s)

OMP 1 2 4 6 12 240

2

4

6

8

10

12

14

16

18

NPROMA 10NPROMA 100

Development of OpenACC/GPU-version

The driver-code with OpenMP-loop kept roughly unchanged GPU to HOST data mapping (ACC DATA) added

Note that OpenACC can (in most cases) co-exist with OpenMP Allows an elegant multi-GPU implementation

CLOUDSC was pre-processed with ”acc_insert” – Perl-script

Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC

In addition some minimal manual source code clean-up

CLOUDSC performance on GPU needs very large NPROMA Lack of multilevel parallelism (only across NPROMA, not NLEV)

Driving OpenACC CLOUDSC with OpenMP

!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs)tid = omp_get_thread_num() ! OpenMP thread numberidgpu = mod(tid, NumGPUs) ! Effective GPU# for this threadCALL acc_set_device_num(idgpu, acc_get_device_type())!$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL), ...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU

CALL CLOUDSC (... array(1,1,IBL) ...) ! Runs on GPU#<idgpu>

!$acc end data END DO!$OMP END DO!$OMP END PARALLEL

Typical values for NPROMA in OpenACC implementation:

> 10,000

Sample OpenACC coding of CLOUDSC!$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB !$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) = ZQX(JL,JK,NCLDQV) + ztmp ENDDO ENDDO!$ACC END KERNELS ASYNC(IBL)

ASYNC removes CUDA-thread syncs

OpenACC scaling (K40c, in GFlops/s)

100 1000 10000 20000 40000 800000

2

4

6

8

10

12

1 GPU2 GPUs

NPROMA

Timing (ms) breakdown : single GPU

10 1000 20000 800000

2000

4000

6000

8000

10000

12000

Other overheadCommunicationComputationHaswell

NPROMA

Saturating GPUs with more work

!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs * 4)tid = omp_get_thread_num() ! OpenMP thread numberidgpu = mod(tid, NumGPUs) ! Effective GPU# for this threadCALL acc_set_device_num(idgpu, acc_get_device_type())!$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL), ...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU

CALL CLOUDSC (... array(1,1,IBL) ...) ! Runs on GPU#<idgpu>

!$acc end data END DO!$OMP END DO!$OMP END PARALLEL

More threads

here

Saturating GPUs with more work

Consider few performance degradation facts at present Parallelism only in NPROMA dimension in CLOUDSC

Updating 60-odd arrays back and forth every time step

OpenACC overhead related to data transfers & ACC DATA

Can we do better ? YES ! We can enable concurrently executed kernels through OpenMP !

• Time-sharing GPU(s) across multiple OpenMP-threads

About 4 simultaneous OpenMP host–threads can saturate a single GPU in our CLOUDSC case

Extra care must be taken to avoid running out of memory on GPU

• Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000

Multiple copies of CLOUDSC per GPU(GFlops/s)

Copies 1 2 40

2

4

6

8

10

12

14

16

1 GPU2 GPUs

nvvp profiler shows time-sharing impact

GPU is 4-way time-

shared

GPU is fed with work by one

OpenMP thread only

Timing (ms) : 4-way time-shared vs. no T/S

10 20000 800000

500

1000

1500

2000

2500

3000

3500

4000

4500

Other overheadCommunicationComputationHaswell

NPROMA

GPU is 4-way time-

shared

GPU is not time-shared

24-core Haswell 2.5GHz vs. K40c GPU(s)(GFlops/s)

Series10

2

4

6

8

10

12

14

16

18

Haswell2 GPUs (T/S)2 GPUs1 GPU (T/S)1 GPU

T/S = GPUs time-

shared

Conclusions

CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF’s tiny GPU cluster in 3Q/2015

Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7)

With CUDA 7.0 and concurrent kernels – it seems – time-sharing (oversubscribing) GPUs with more work pays off

Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs

The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA)

slide 1 using openacc in ifs physics’ cloud scheme (cloudsc) sami saarinen ecmwf basic gpu...

Documents

use of acc

gpus nproma

intel haswell model

max performance slide

npromaloop ibl

pcieeach gpu

pgi compiler

k40 gpugrid point columns