Download - The Parallel Programming Osmotic Membraneorap.irisa.fr/wp-content/uploads/2020/01/Jesus-Labarta-Orap-44.pdf · The Parallel Programming Osmotic Membrane Prof. Jesús Labarta BSC &

TheParallelProgrammingOsmoticMembraneProf.JesúsLabartaBSC&UPC

44émeForumORAP

Paris,Nov.29th2019

2

Codelifetimecycle

NeedInterestIdeamodel

Code

Code

Code

Code

“Platform/languagespecificities”“Optimizations”

“Hardwiredorder/schedules”

Code

3

ThereisHOPE!!!


Code

Code

Code

Code Code

CodePerformanceportabilityMaintainable,adaptable

Focusonlogic

4 4

ThePMosmoticmembrane

ISA/API

Applications

Powertotheruntime

PM:High-level,clean,abstractinterface

Generalpurpose

Task&databased

Forgetaboutresources

Decouple:Minimal&sufficient

permeability?

Intelligence&

Resourcemanagement

“Reuse&expand”oldarchitecturalideasunder

newconstraints

Vis……sual……lization

Sto..rage

5

Integrateconcurrencyanddata  Singlemechanism

  Concurrency:  DependencesbuiltfromdataaccessesLookahead:Aboutinstantiatingwork

  Locality&datamanagement  Fromdataaccesses

6

OmpSs•  AforerunnerforOpenMP

+Taskprototyping

+Taskdependences

+Taskpriorities+Taskloopprototyping

+Taskreductions+Taskwaitdependences+OMPTimpl.+Multideps+Commutative

+Taskloopdependences+Dataaffinity

Today

Therealrevolution•  HybridTaskbased~OK,but…

•  “Proper”modeldoesnotguarantee“proper”programs•  Flexibilityàcanbeused“wrong”•  Legacyfeatureshaveastrong“shadow”

•  Revolutionisinthemindsetofprogrammers•  “Forget”abouthardware,resources•  Focusonprogramlogic•  Methodology

•  Topdownprogrammingmethodology•  Everylevelcontribute

•  Throughputoriented•  trynottostall!

•  Thinkglobal:•  maybeunprecise

•  Specifylocal:•  Precise

Toexascale...andbeyondbefore

AnditISdisruptive!!!

8

Contact:https://www.pop-coe.eumailto:[email protected]

ThisprojecthasreceivedfundingfromtheEuropeanUnion‘sHorizon2020researchandinnovationprogrammeundergrantagreementNo676553.

PerformanceOptimisationandProductivityACentreofExcellenceinComputingApplications

Somerecommendations

10

“Recommendations”•  Don’tmaskyoursymptoms…trytounderstand•  Taskifywithdependences…donotstall!!!!

•  Computation•  communications

•  Donest…everybody'scontributioncounts•  Topdown•  Preciseregionspecification

•  Dohint...donotforce!!!•  Thinkmalleable….bewatermyfriend•  Homogenizeheterogeneity•  Bewareofreductionsonlargearrays•  …

….allofthiscanbeachievedinanincrementalway

11

Don´tmaskyoursymptoms…

ParallelEfficiency

CommunicationEfficiency

LoadBalance

SerializationEfficiency

TransferEfficiency

GlobalEfficiency

ComputationEfficiency

IPCscalingEfficiency

Instructionscaling

Efficiency

FrequencyEfficiency

Sharingeffects(Cachecapacity,

memorybandwidth,power)

Memory

BW

EffectiveCachecapacity

Codereplication

DependenciesOS

noise

Instructionmix

Localitydifferences

acrossthreads

Runtimeoverhead

Instructionbalance

Codepatterns

Kernel

scheduling NIC

Usagepractices

https://pop-coe.eu

12 12

…trytounderstand

Alya–FullPlane@1Kcores

IPCsolver

IPCassembly

IPCimbalance@solver,MPICHIsendduration@solver,OpenMPI

700-900us

13 13

…trytounderstand

SerializedunpackingVeryfinegrainparallelization

ofindividualunpacks

IFS-SP420x4

Serializedcommunicationpattern

14

Don´tmaskyoursymptoms…


Code

Code

Code

Code

“Platform/languagespecificities”“Optimizations”

“Hardwiredorder/schedules”

Code

Whereistheundobutton??

15

TaskifycomputationFour loops/routines Sequential program order

Fork join parallelism not parallelizing one loop

Task based, top down parallelism not parallelizing one loop and still fine

GROMACS

“Achanceforlazyprogrammers”

16

Taskifycommunication•  MPI+OpenMP

•  NonblocingMPI+OpenMP

•  MPI+OmpSs

•  Topdown•  Overlapcommunications•  SerialFFTs

•  Replicatecommunicators

Parallel

Functio

ns

Parallel

Functio

ns

Task

Depe

nden

cies

FFTlib-QuantumEspresominiapp

17

Taskifycommunications

1:RIMP2_RMP2Energy_InCore_V_MPIOMP()…405:DOLNumber_Base…498:DGEMM…518:if(something){wait;//forcurrentiter.Isend,Irecv;//fornextiter.}allreduce588:Doloops

EvaluatingMP2correlation636:ENDDOENDO

NTCHEMDon’tmaskyoursymptoms

Topdown

Leaveorder(overlap)toOpenMP

18

for (int R = 1; R < rows-1; R += bs) { #pragma oss task weakin(matrix[R-1][1;cols-2] \ matrix[R+bs][1;cols-2]) weakinout(matrix[R;bs][1;cols-2]) for (int C = 1; C < cols-1; C += bs) { #pragma oss task in(matrix[R-1][C;bs], matrix[R;bs][C-1], \ matrix[R;bs][C+bs]) inout(matrix[R;bs][C;bs]) for (int r=R; (r<rows-1) && (r<R+bs); r++) for (int c=C; (c<cols-1) && (c<C+bs); r++) matrix[r][c] = 0.25 * (matrix[r][c] + matrix[r][c] + matrix[r][c] + matrix[r][c]); } } }

Donest•  Nesttasks,notparallels•  Dynamicoverdecomposition

DMRG++

J. Criado et all. “Optimization of Condensed Matter Physics Application with OpenMP Tasking Model”. IWOMP2019

19 19

Donestfor ( d = 0; d < dt; d++){ for ( c = 0; c < ct; c++){ #pragma oss task … dtrsm( … ); } decide(final_1); for ( c = 0; c < ct; c++) { #pragma oss task weakinout (TILEB[d:rt-1][c]) … final(final_1) for ( r = d; r < rt; r++) decide(inal_2); #pragma oss task inout(TILEB[r][c]) ... final(final_2) dgemm( … ); } } }

LampsOpenMP

LampsOmpSs

20

DoHint•  Priorities•  Anti-dependencedistances

J. Criado et all. “Optimization of Condensed Matter Physics Application with OpenMP Tasking Model”. IWOMP2019

21

HomogenizeHeterogeneity•  Performanceheterogeneity

•  ISAheterogeneity

•  Severalnoncoherentaddressspaces

22

Thinkmalleable•  DynamicLoadBalance&Resourcemanagement

•  Intra/interprocess/application

•  Library(DLB)•  Runtimeinterception(MPIP,OMPT,…)•  APItohintresourcedemands•  Corereallocationpolicy

•  OpportunitytofightAmdalh’slaw•  Productive/Easy!!!

•  Nx1•  Hybridizeimbalancedregions

RelationalDiscovery“LeWI: A Runtime Balancing Algorithm for Nested Parallelism”. M.Garcia et al. ICPP09 “Hints to improve automatic load balancing with LeWI for hybrid applications” JPDC2014

23

Thinkmalleable•  DynamicLoadBalance&Resourcemanagement

•  Intra/interprocess/application

•  Library(DLB)•  Runtimeinterception(MPIP,OMPT,…)•  APItohintresourcedemands•  Corereallocationpolicy

•  OpportunitytofightAmdalh’slaw•  Productive/Easy!!!

•  Nx1•  Hybridizeimbalancedregions

RelationalDiscovery“LeWI: A Runtime Balancing Algorithm for Nested Parallelism”. M.Garcia et al. ICPP09 “Hints to improve automatic load balancing with LeWI for hybrid applications” JPDC2014

24

Bewareofreductionsonlargearrays•  Reductionswithindirectiononlargearrayswithindirections

•  Atomic•  Arrayprivatization+finalreduction•  Coloring•  Serialize

•  Commutativeclause•  Specifyincompatibilities!!!

•  Commutative+multidependences

Atalllevels

26

CoreParallelEnsemble,workflow

Atalllevels

AveragetaskGranularity:

nanosecondsmicrosecondsmillisecondssecondshoursdays

Languagebinding:

C,pragmas/intrinsicsC,C++,FORTRANJava,Python

Addressspacetocomputedependences:

Registers MemoryObjectsFiles

OmpSs

StarSs

@ SMP @ GPU @ Cluster @ FPGA

27

StarSs

CoreParallelEnsemble,workflow

Atalllevels

AveragetaskGranularity:

nanosecondsmicrosecondsmillisecondssecondshoursdays

Languagebinding:

C,pragmas/intrinsicsC,C++,FORTRANJava,Python

Addressspacetocomputedependences:

Registers MemoryObjectsFiles

OmpSs

@ SMP @ GPU @ Cluster @ FPGA

COMPSsPyCOMPSs

@ Cloud @ Cluster @ Multicore

(long)vectors

@ vector ISA

28 28

PyCOMPSs num_frags = … todo_list = init_list(); for i in range(num_frags): result =cc_sur(todo_list[i]) gather(result, global_result) compss_stop()

Main program

@task(returns = list, result = IN, global_result = INOUT)) def gather(result, global_result): @task(returns = list) def cc_sur():

Tasks definition

RISC-V VECTOR COMPILER

¡  Programming à RISC-V Vector extension ISA ¡  Intrinsics (https://repo.hca.bsc.es/gitlab/rferrer/epi-builtins-ref/blob/master/README.md#documentation)

¡  Pragma omp simd

¡  Automatic parallelization

¡  Compiler

¡  LLVM based

¡  Compiler Explorer (http://repo.hca.bsc.es/epic)

¡  Interactive impact of source code modifications à code generated

¡  A few example RISC-V vector codes using intrinsics

29

RISC-V VECTOR EMULATOR

¡  Emulation of RISC-V Vector ISA ¡  Parametrized MAXVL

¡  Simple Timing models ¡  Memory architecture

¡  Instruction timing

¡  Very detailed analytics in Paraver à co-design ¡  Vector length

¡  Register use

¡  Memory addresses, cache ratios

¡  Instruction timing

¡  …

30

LLVM

.prv

Emulation environment

Emulation library

trace2prv

for (int kk = 0; kk < n; kk += bk) { vb0 = __builtin_epi_vload_f64(&b[kk ][jj]); vb1 = __builtin_epi_vload_f64(&b[kk+1][jj]); vb2 = __builtin_epi_vload_f64(&b[kk+2][jj]); vb3 = __builtin_epi_vload_f64(&b[kk+3][jj]); { __epi_f64 vta0, vta1, vta2, vta3; __epi_f64 vtp0, vtp1, vtp2, vtp3; vta0 = __builtin_epi_vbroadcast_f64(a[ii][kk]); vtp0 = __builtin_epi_vfmul_f64(vta0, vb0); vc0 = __builtin_epi_vfadd_f64(vc0, vtp0);

Timing model

Paraver

MUSA

Finalthoughts…

33

Agebeforebeauty•  Behavior(insight/models) beforesyntax•  Detailperformanceanalytics beforeaggregatedprofiles• Workinstantiationandorder beforeoverhead•  Malleability beforefittedrigidstructure•  Possibilities beforehowtos•  Elegance beforeonedayshine

•  Allaboutprogrammermindset!!!

Elabuelocebolleta

atacadenuevo

34

Contact:https://www.pop-coe.eumailto:[email protected]

ThisprojecthasreceivedfundingfromtheEuropeanUnion‘sHorizon2020researchandinnovationprogrammeundergrantagreementNo676553.

PerformanceOptimisationandProductivityACentreofExcellenceinComputingApplications

Thanks!

Download - The Parallel Programming Osmotic Membraneorap.irisa.fr/wp-content/uploads/2020/01/Jesus-Labarta-Orap-44.pdf · The Parallel Programming Osmotic Membrane Prof. Jesús Labarta BSC &

Top Related