TheParallelProgrammingOsmoticMembraneProf.JesúsLabartaBSC&UPC
44émeForumORAP
Paris,Nov.29th2019
2
Codelifetimecycle
NeedInterestIdeamodel
Code
Code
Code
Code
“Platform/languagespecificities”“Optimizations”
“Hardwiredorder/schedules”
Code
3
ThereisHOPE!!!
NeedInterestIdeamodel
Code
Code
Code
Code Code
CodePerformanceportabilityMaintainable,adaptable
Focusonlogic
4 4
ThePMosmoticmembrane
ISA/API
Applications
Powertotheruntime
PM:High-level,clean,abstractinterface
Generalpurpose
Task&databased
Forgetaboutresources
Decouple:Minimal&sufficient
permeability?
Intelligence&
Resourcemanagement
“Reuse&expand”oldarchitecturalideasunder
newconstraints
Vis……sual……lization
Sto..rage
5
Integrateconcurrencyanddata Singlemechanism
Concurrency: DependencesbuiltfromdataaccessesLookahead:Aboutinstantiatingwork
Locality&datamanagement Fromdataaccesses
6
OmpSs• AforerunnerforOpenMP
+Taskprototyping
+Taskdependences
+Taskpriorities+Taskloopprototyping
+Taskreductions+Taskwaitdependences+OMPTimpl.+Multideps+Commutative
+Taskloopdependences+Dataaffinity
Today
Therealrevolution• HybridTaskbased~OK,but…
• “Proper”modeldoesnotguarantee“proper”programs• Flexibilityàcanbeused“wrong”• Legacyfeatureshaveastrong“shadow”
• Revolutionisinthemindsetofprogrammers• “Forget”abouthardware,resources• Focusonprogramlogic• Methodology
• Topdownprogrammingmethodology• Everylevelcontribute
• Throughputoriented• trynottostall!
• Thinkglobal:• maybeunprecise
• Specifylocal:• Precise
Toexascale...andbeyondbefore
AnditISdisruptive!!!
8
Contact:https://www.pop-coe.eumailto:[email protected]
ThisprojecthasreceivedfundingfromtheEuropeanUnion‘sHorizon2020researchandinnovationprogrammeundergrantagreementNo676553.
PerformanceOptimisationandProductivityACentreofExcellenceinComputingApplications
Somerecommendations
10
“Recommendations”• Don’tmaskyoursymptoms…trytounderstand• Taskifywithdependences…donotstall!!!!
• Computation• communications
• Donest…everybody'scontributioncounts• Topdown• Preciseregionspecification
• Dohint...donotforce!!!• Thinkmalleable….bewatermyfriend• Homogenizeheterogeneity• Bewareofreductionsonlargearrays• …
….allofthiscanbeachievedinanincrementalway
11
Don´tmaskyoursymptoms…
ParallelEfficiency
CommunicationEfficiency
LoadBalance
SerializationEfficiency
TransferEfficiency
GlobalEfficiency
ComputationEfficiency
IPCscalingEfficiency
Instructionscaling
Efficiency
FrequencyEfficiency
Sharingeffects(Cachecapacity,
memorybandwidth,power)
Memory
BW
EffectiveCachecapacity
Codereplication
DependenciesOS
noise
Instructionmix
Localitydifferences
acrossthreads
Runtimeoverhead
Instructionbalance
Codepatterns
Kernel
scheduling NIC
Usagepractices
https://pop-coe.eu
12 12
…trytounderstand
Alya–FullPlane@1Kcores
IPCsolver
IPCassembly
IPCimbalance@solver,MPICHIsendduration@solver,OpenMPI
700-900us
13 13
…trytounderstand
SerializedunpackingVeryfinegrainparallelization
ofindividualunpacks
IFS-SP420x4
Serializedcommunicationpattern
14
Don´tmaskyoursymptoms…
NeedInterestIdeamodel
Code
Code
Code
Code
“Platform/languagespecificities”“Optimizations”
“Hardwiredorder/schedules”
Code
Whereistheundobutton??
15
TaskifycomputationFour loops/routines Sequential program order
Fork join parallelism not parallelizing one loop
Task based, top down parallelism not parallelizing one loop and still fine
GROMACS
“Achanceforlazyprogrammers”
16
Taskifycommunication• MPI+OpenMP
• NonblocingMPI+OpenMP
• MPI+OmpSs
• Topdown• Overlapcommunications• SerialFFTs
• Replicatecommunicators
Parallel
Functio
ns
Parallel
Functio
ns
Task
Depe
nden
cies
FFTlib-QuantumEspresominiapp
17
Taskifycommunications
1:RIMP2_RMP2Energy_InCore_V_MPIOMP()…405:DOLNumber_Base…498:DGEMM…518:if(something){wait;//forcurrentiter.Isend,Irecv;//fornextiter.}allreduce588:Doloops
EvaluatingMP2correlation636:ENDDOENDO
NTCHEMDon’tmaskyoursymptoms
Topdown
Leaveorder(overlap)toOpenMP
18
for (int R = 1; R < rows-1; R += bs) { #pragma oss task weakin(matrix[R-1][1;cols-2] \ matrix[R+bs][1;cols-2]) weakinout(matrix[R;bs][1;cols-2]) for (int C = 1; C < cols-1; C += bs) { #pragma oss task in(matrix[R-1][C;bs], matrix[R;bs][C-1], \ matrix[R;bs][C+bs]) inout(matrix[R;bs][C;bs]) for (int r=R; (r<rows-1) && (r<R+bs); r++) for (int c=C; (c<cols-1) && (c<C+bs); r++) matrix[r][c] = 0.25 * (matrix[r][c] + matrix[r][c] + matrix[r][c] + matrix[r][c]); } } }
Donest• Nesttasks,notparallels• Dynamicoverdecomposition
DMRG++
J. Criado et all. “Optimization of Condensed Matter Physics Application with OpenMP Tasking Model”. IWOMP2019
19 19
Donestfor ( d = 0; d < dt; d++){ for ( c = 0; c < ct; c++){ #pragma oss task … dtrsm( … ); } decide(final_1); for ( c = 0; c < ct; c++) { #pragma oss task weakinout (TILEB[d:rt-1][c]) … final(final_1) for ( r = d; r < rt; r++) decide(inal_2); #pragma oss task inout(TILEB[r][c]) ... final(final_2) dgemm( … ); } } }
LampsOpenMP
LampsOmpSs
20
DoHint• Priorities• Anti-dependencedistances
J. Criado et all. “Optimization of Condensed Matter Physics Application with OpenMP Tasking Model”. IWOMP2019
21
HomogenizeHeterogeneity• Performanceheterogeneity
• ISAheterogeneity
• Severalnoncoherentaddressspaces
22
Thinkmalleable• DynamicLoadBalance&Resourcemanagement
• Intra/interprocess/application
• Library(DLB)• Runtimeinterception(MPIP,OMPT,…)• APItohintresourcedemands• Corereallocationpolicy
• OpportunitytofightAmdalh’slaw• Productive/Easy!!!
• Nx1• Hybridizeimbalancedregions
RelationalDiscovery“LeWI: A Runtime Balancing Algorithm for Nested Parallelism”. M.Garcia et al. ICPP09 “Hints to improve automatic load balancing with LeWI for hybrid applications” JPDC2014
23
Thinkmalleable• DynamicLoadBalance&Resourcemanagement
• Intra/interprocess/application
• Library(DLB)• Runtimeinterception(MPIP,OMPT,…)• APItohintresourcedemands• Corereallocationpolicy
• OpportunitytofightAmdalh’slaw• Productive/Easy!!!
• Nx1• Hybridizeimbalancedregions
RelationalDiscovery“LeWI: A Runtime Balancing Algorithm for Nested Parallelism”. M.Garcia et al. ICPP09 “Hints to improve automatic load balancing with LeWI for hybrid applications” JPDC2014
24
Bewareofreductionsonlargearrays• Reductionswithindirectiononlargearrayswithindirections
• Atomic• Arrayprivatization+finalreduction• Coloring• Serialize
• Commutativeclause• Specifyincompatibilities!!!
• Commutative+multidependences
Atalllevels
26
CoreParallelEnsemble,workflow
Atalllevels
AveragetaskGranularity:
nanosecondsmicrosecondsmillisecondssecondshoursdays
Languagebinding:
C,pragmas/intrinsicsC,C++,FORTRANJava,Python
Addressspacetocomputedependences:
Registers MemoryObjectsFiles
OmpSs
StarSs
@ SMP @ GPU @ Cluster @ FPGA
27
StarSs
CoreParallelEnsemble,workflow
Atalllevels
AveragetaskGranularity:
nanosecondsmicrosecondsmillisecondssecondshoursdays
Languagebinding:
C,pragmas/intrinsicsC,C++,FORTRANJava,Python
Addressspacetocomputedependences:
Registers MemoryObjectsFiles
OmpSs
@ SMP @ GPU @ Cluster @ FPGA
COMPSsPyCOMPSs
@ Cloud @ Cluster @ Multicore
(long)vectors
@ vector ISA
28 28
PyCOMPSs num_frags = … todo_list = init_list(); for i in range(num_frags): result =cc_sur(todo_list[i]) gather(result, global_result) compss_stop()
Main program
@task(returns = list, result = IN, global_result = INOUT)) def gather(result, global_result): @task(returns = list) def cc_sur():
Tasks definition
RISC-V VECTOR COMPILER
¡ Programming à RISC-V Vector extension ISA ¡ Intrinsics (https://repo.hca.bsc.es/gitlab/rferrer/epi-builtins-ref/blob/master/README.md#documentation)
¡ Pragma omp simd
¡ Automatic parallelization
¡ Compiler
¡ LLVM based
¡ Compiler Explorer (http://repo.hca.bsc.es/epic)
¡ Interactive impact of source code modifications à code generated
¡ A few example RISC-V vector codes using intrinsics
29
RISC-V VECTOR EMULATOR
¡ Emulation of RISC-V Vector ISA ¡ Parametrized MAXVL
¡ Simple Timing models ¡ Memory architecture
¡ Instruction timing
¡ Very detailed analytics in Paraver à co-design ¡ Vector length
¡ Register use
¡ Memory addresses, cache ratios
¡ Instruction timing
¡ …
30
LLVM
.prv
Emulation environment
Emulation library
trace2prv
for (int kk = 0; kk < n; kk += bk) { vb0 = __builtin_epi_vload_f64(&b[kk ][jj]); vb1 = __builtin_epi_vload_f64(&b[kk+1][jj]); vb2 = __builtin_epi_vload_f64(&b[kk+2][jj]); vb3 = __builtin_epi_vload_f64(&b[kk+3][jj]); { __epi_f64 vta0, vta1, vta2, vta3; __epi_f64 vtp0, vtp1, vtp2, vtp3; vta0 = __builtin_epi_vbroadcast_f64(a[ii][kk]); vtp0 = __builtin_epi_vfmul_f64(vta0, vb0); vc0 = __builtin_epi_vfadd_f64(vc0, vtp0);
Timing model
Paraver
MUSA
RISC-V VECTOR VISUALIZATION
¡ FE mockup
Program Counter
Memory address
Memory instruction cost
Copyright © European Processor Initiative 2019. EPI SC19/Denver/Nov 2019 31
Finalthoughts…
33
Agebeforebeauty• Behavior(insight/models) beforesyntax• Detailperformanceanalytics beforeaggregatedprofiles• Workinstantiationandorder beforeoverhead• Malleability beforefittedrigidstructure• Possibilities beforehowtos• Elegance beforeonedayshine
• Allaboutprogrammermindset!!!
Elabuelocebolleta
atacadenuevo
34
Contact:https://www.pop-coe.eumailto:[email protected]
ThisprojecthasreceivedfundingfromtheEuropeanUnion‘sHorizon2020researchandinnovationprogrammeundergrantagreementNo676553.
PerformanceOptimisationandProductivityACentreofExcellenceinComputingApplications
Thanks!