slide 1 using openacc in ifs physics’ cloud scheme (cloudsc) sami saarinen ecmwf basic gpu...
TRANSCRIPT
Slide 1
Using OpenACC in IFS Physics’Cloud Scheme (CLOUDSC)
Sami Saarinen
ECMWF
Basic GPU Training
Sept 16-17, 2015
Slide 2
Background
Back in 2014 : Adaptation of IFS physics’ cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme
Emphasis was on GPU-migration by use of OpenACC directives
CLOUDSC consumes about 10% of IFS Forecast time Some 3500 lines of Fortran2003 – before OpenACC directives
This presentation concentrates comparing performances on Haswell – OpenMP version of CLOUDSC
NVIDIA GPU (K40) – OpenACC version of CLOUDSC
Slide 3
Some earlier results
Baseline results down from 40s to 0,24s on K40 GPU PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014)
Also Cray CCE 8.4 OpenACC-compiler was tried
OpenACC directives inserted automatically By use of acc_insert Perl script followed by manual cleanup
Source code lines expanded from 3500 to 5000 in CLOUDSC !
The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side
GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz)
Data transfer added serious overheads
Strange DATA PRESENT testing & memory pinning slowdowns
Slide 4
The problem setup for this case study
Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV)
About 80,000 columns fit into one K40 GPU
Grid point columns are independent of each other So no horizontal dependencies here, but ...
... level dependency prevents parallelization along vertical dim
Arrays are organized in blocks of grid point columns
Instead of using ARRAY(NGPTOT, NLEV) ...
... we use ARRAY(NPROMA, NLEV, NBLKS) NPROMA is a (runtime) fixed blocking factor
Arrays are OpenMP thread safe over NBLKS
Slide 5
Hardware, compiler & NPROMA’s used
Haswell-node : 24-cores @ 2.5GHz
2 x NVIDIA K40c GPUs on each Haswell-node via PCIe Each GPU equipped with 12GB memory – with CUDA 7.0
PGI Compiler 15.7 with OpenMP & OpenACC –O4 –fast –mp=numa,allcores,bind –Mfprelaxed
–tp haswell –Mvect=simd:256 [ -acc ]
Environment variables PGI_ACC_NOSHARED=1
PGI_ACC_BUFFERSIZE=4M
Typical good NPROMA value for Haswell ~ 10 – 100
Per GPUs NPROMA up to 80,000 for max performance
Slide 6
Haswell : Driving CLOUDSC with OpenMP
REAL(kind=8) :: array(NPROMA, NLEV, NGPBLKS)
!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND)!$OMP DO SCHEDULE(DYNAMIC,1)
DO JKGLO=1,NGPTOT,NPROMA ! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA
CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,IBL), & ! ~ 65 arrays like this ) END DO
!$OMP END DO!$OMP END PARALLEL
Typical values for NPROMA in OpenMP
implementation:10 – 100
Slide 7
OpenMP scaling (Haswell, in GFlops/s)
OMP 1 2 4 6 12 240
2
4
6
8
10
12
14
16
18
NPROMA 10NPROMA 100
Slide 8
Development of OpenACC/GPU-version
The driver-code with OpenMP-loop kept roughly unchanged GPU to HOST data mapping (ACC DATA) added
Note that OpenACC can (in most cases) co-exist with OpenMP Allows an elegant multi-GPU implementation
CLOUDSC was pre-processed with ”acc_insert” – Perl-script
Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC
In addition some minimal manual source code clean-up
CLOUDSC performance on GPU needs very large NPROMA Lack of multilevel parallelism (only across NPROMA, not NLEV)
Slide 9
Driving OpenACC CLOUDSC with OpenMP
!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs)tid = omp_get_thread_num() ! OpenMP thread numberidgpu = mod(tid, NumGPUs) ! Effective GPU# for this threadCALL acc_set_device_num(idgpu, acc_get_device_type())!$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL), ...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU
CALL CLOUDSC (... array(1,1,IBL) ...) ! Runs on GPU#<idgpu>
!$acc end data END DO!$OMP END DO!$OMP END PARALLEL
Typical values for NPROMA in OpenACC implementation:
> 10,000
Slide 10
Sample OpenACC coding of CLOUDSC!$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB !$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) = ZQX(JL,JK,NCLDQV) + ztmp ENDDO ENDDO!$ACC END KERNELS ASYNC(IBL)
ASYNC removes CUDA-thread syncs
Slide 11
OpenACC scaling (K40c, in GFlops/s)
100 1000 10000 20000 40000 800000
2
4
6
8
10
12
1 GPU2 GPUs
NPROMA
Slide 12
Timing (ms) breakdown : single GPU
10 1000 20000 800000
2000
4000
6000
8000
10000
12000
Other overheadCommunicationComputationHaswell
NPROMA
Slide 13
Saturating GPUs with more work
!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs * 4)tid = omp_get_thread_num() ! OpenMP thread numberidgpu = mod(tid, NumGPUs) ! Effective GPU# for this threadCALL acc_set_device_num(idgpu, acc_get_device_type())!$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL), ...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU
CALL CLOUDSC (... array(1,1,IBL) ...) ! Runs on GPU#<idgpu>
!$acc end data END DO!$OMP END DO!$OMP END PARALLEL
More threads
here
Slide 14
Saturating GPUs with more work
Consider few performance degradation facts at present Parallelism only in NPROMA dimension in CLOUDSC
Updating 60-odd arrays back and forth every time step
OpenACC overhead related to data transfers & ACC DATA
Can we do better ? YES ! We can enable concurrently executed kernels through OpenMP !
• Time-sharing GPU(s) across multiple OpenMP-threads
About 4 simultaneous OpenMP host–threads can saturate a single GPU in our CLOUDSC case
Extra care must be taken to avoid running out of memory on GPU
• Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000
Slide 15
Multiple copies of CLOUDSC per GPU(GFlops/s)
Copies 1 2 40
2
4
6
8
10
12
14
16
1 GPU2 GPUs
Slide 16
nvvp profiler shows time-sharing impact
GPU is 4-way time-
shared
GPU is fed with work by one
OpenMP thread only
Slide 17
Timing (ms) : 4-way time-shared vs. no T/S
10 20000 800000
500
1000
1500
2000
2500
3000
3500
4000
4500
Other overheadCommunicationComputationHaswell
NPROMA
GPU is 4-way time-
shared
GPU is not time-shared
Slide 18
24-core Haswell 2.5GHz vs. K40c GPU(s)(GFlops/s)
Series10
2
4
6
8
10
12
14
16
18
Haswell2 GPUs (T/S)2 GPUs1 GPU (T/S)1 GPU
T/S = GPUs time-
shared
Slide 19
Conclusions
CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF’s tiny GPU cluster in 3Q/2015
Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7)
With CUDA 7.0 and concurrent kernels – it seems – time-sharing (oversubscribing) GPUs with more work pays off
Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs
The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA)