advanced programming model constructs using ... - nersc › assets › uploads ›...

1
Memory Layout Proper use of the high bandwidth memory can have a major impact on performance. This can either done with the memkind library, or numactl, or by using the quadcache configuraAon of the KNL. Changing the way the matrix is allocated can also improve performance when operaAng on blocks of memory, especially on larger matrix sizes. Blocking OpGmizaGons When the number of tasks are mapped to the number of blocks either the size of the block gets larger to accommodate larger matrix sizes, or the number of blocks increases. In order to avoid the overhead of too many tasks, or the loss of locality from oversized blocks, tasks must be mapped to mulAple blocks, and if possible mulAple iteraAons over the same blocks. Figure 6 shows the performance improvement of 3 opAmizaAons for task dependencies; combining mulAple blocks per tasks (block), combining mulAple blocks and iteraAons per task (block-iter), and a recursive cache oblivious version (block-rec)that operates on mulAple blocks and iteraAons. Fig 1. Shows the performance of the different approaches on KNL Advanced Programming Model Constructs Using Tasking on the Latest NERSC (Knights Landing) Hardware Jeremy Kemp 1 , Alice Koniges 2 , Yun (Helen) He 2 , and Barbara Chapman 3 University of Houston, Houston, TX 1 NERSC, Lawrence Berkeley NaAonal Laboratory, Berkeley, CA 2 Stony Brook University, Stony Brook, NY 3 1. hqp://colfaxresearch.com/knl-numa/ 2. Heller, Thomas, Hartmut Kaiser, and Klaus Iglberger. "ApplicaAon of the ParalleX execuAon model to stencil-based problems." Computer Science-Research and Development 28.2-3 (2013): 253-261. 3. hqp://www.exmatex.org/comd.html 4. hqp://www.nersc.gov/users/computaAonal-systems/cori/ 5. hqps://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/knl-white-boxes/ References ApplicaGon Kernels Performance Overview The parallel for loop is the default approach to parallelism in OpenMP. These two charts provide an overview of tasking performance relaAve to this, and how much benefit there is from addiAonal opAmizaAons, which easily surpass the performance of worksharing. Jacobi Jacobi has very liqle cache reuse, as it only writes a given element once per iteraAon. As a result, dividing up the matrix into blocks has no benefit, so the block combinaAon opAmizaAons don’t apply. IteraAon combining is possible, but more complex, due to overlapping read/writes between neighbors. For the Jacobi iteraAon opAmizaAon, the second matrix is removed, and replaced by 3 threadlocal scratch rows and 4 synchronizaAon rows per chunk of rows. Each task performs an iteraAon for 3 rows wriAng the results into the scratch rows, and then wriAng the second iteraAon back to the original matrix. Applying OpGmizaGons Locality is important, and difficult to do with OpenMP tasks; the consistent poor performance of tasks (without task dependencies) shows how much overhead is introduced and how much locality is given up. On the other hand, unopAmized task dependencies demonstrate how much can be gained by removing unnecessary synchronizaAon. The opAmizaAons on the task dependencies can then improve the locality and achieve much beqer performance than the parallel for loops. There are two major differences when programming for KNL over Haswell; KNL has very liqle cache per thread relaAve to Haswell, and using MCDRAM properly can drasAcally improve performance. The Jacobi results illustrate both of these points very well. The fastest version on Haswell performs the worst on KNL due to the cache size, and the versions that move through memory sequenAally perform much beqer. Future work includes further applicaAons of the opAmizaAons explored with LU to Jacobi, CoMD, and other applicaAons. Conclusions OpGmizing the LU Kernel LU decomposiGon The iniAal matrix is divided into a 2D matrix of blocks to improve cache usage, and enable parallelizaAon. There are 4 disAnct operaAons divided into 3 phases. The worksharing (parallel for) version divides these into 3 phases for each iteraAon, with a barrier between each phase. The tasking version has a similar structure to the parallel for version, spawning tasks inside of loops and then synchronizing on taskwait instead of barriers. The task dependency version removes the synchronizaAon between phases as well as between iteraAons, as Illustrated by diagram 1. LU has no communicaAon or synchronizaAon between blocks. The task dependencies simply control access to the matrix so only one thread at a Ame is wriAng to it. Hardware Jacobi Solver As diagram 2 shows, each element of the matrix depends on each of its neighbors. As a result, overwriAng an element will change the result of its neighbor, so a second matrix is typically wriqen to, and then swapped with the original matrix at the end of every iteraAon. Each version of Jacobi divides up the 2D matrix into groups of whole rows instead of blocks. This reduces false sharing and enables very sequenAal access of memory. Similar to LU, the tasking version is similar to the parallel for version, and the task dependency version removes the synchronizaAon between iteraAons. CoMD Data in CoMD consists of atoms in 3 dimensional space, where each atom has a posiAon, velocity, energy, and force. The 3D space is divided into boxes, and each atom is placed into a box. Each iteraAon is a Amestep where each of the aqributes is recalculated for each atom, and then atom is moved to its corresponding box, if it leC its previous box. The majority of the compute Ame is spent calculaAng force between parAcles, where the tasking version has one task for calculaAng forces between atoms in a pair of boxes. The worksharing and task dependency versions both divide up the work by boxes, calculaAng the interacAons with all of its neighbors. Cells not affecting current cell Neighboring Cells Current cell to be updated Buffer Cells Jacobi and using OpenMP Tasks to Parallelize Jacobi Code OpenMP Tasks for a Single Iteration Task1 . . . Task4 . . . Old Matrix New Matrix Swap Old and New Matrix after each Iteration . . . Cellnext =0.2*(Cellno w+ Σ Neighbours) Input Output KNL (From the CARL NERSC KNL testbed) has 64 cores, each with 4 hardware threads. Two cores are grouped in to pairs as a Ale and share 1 MB of L2 cache. There is no L3 cache, but there is 16 GB of high bandwidth memory that can be configured as cache or allocated manually in different modes. The results on this poster use the quadflat and quadcache configuraAons. The cache configuraAon turns the MCDRAM into cache that is no longer programmable, while the quadflat configuraAon enables the use of memkind or numactl to more finely control how memory is used. Fig 2. Shows the performance of the different approaches on Haswell Fig 3. Shows the improvement from allocaAng in high bandwidth memory via numactl. Fig 4. Shows the performance gained from allocaAng the matrix in columns of blocks in place of a single allocaAon. Fig 6. Shows the performance gained from combining mulAple blocks into a task and the performance gained by combining mulAple blocks and iteraAons per task. IntroducGon Most shared memory programming in HPC is done with highly synchronous constructs such as the “parallel for” in OpenMP. With the increasing core counts and non-uniformity in emerging hardware, a more asynchronous programming model is needed. The goal of this summer was to explore OpenMP tasks on the Knights Landing (KNL) hardware that will be used in Cori Phase II to demonstrate the potenAal benefits of an asynchronous programming model. This is done with two kernels, LU decomposiAon and an iteraAve Jacob Kernel, as well as a proxy applicaAon, CoMD 1 . For each of these applicaAons, a tasking version without data dependencies was developed to show the performance cost of moving to tasks. Then a version with task dependencies shows the improvements that can be gained from removing unnecessary synchronizaAon, and finally several opAmizaAons on the task dependency version show how much addiAonal performance can be gained. 1 unopAmized tasking version provided by Riyaz Haquw (UCLA) and Bronis deSupinski (LLNL) Diagram 1. Shows the dependencies between blocks in the LU Kernel. Diagram 2. Shows the dependencies between blocks in the Jacobi Kernel. Diagram 3. Shows the division of space into boxes and the atoms involved in a single calculaAon CoMD With CoMD, the iniAal conversion of the force funcAon to task dependencies improved performance. The all-task- dep conversion of the applicaAon replaced all parallelism and synchronizaAon in the applicaAon with task dependencies, including serial regions with data dependencies. Combining blocks would have been a beqer first opAmizaAon, as it would have only reduced overhead where the full conversion introduced too much overhead and hurt performance. Whole rows are too large to fit in the L2 of the KNL, so the cache reuse does not improve. Whereas the Haswell performance more than doubles due to to the rows fixng into the very large L3 cache. Further implementaAon work is needed on an iteraAon combining version that operates on blocks that fit inside of smaller Caches. For comparison, Cori Phase 1 nodes have 2 Haswell processors with a total of 32 cores, 2 hyperthreads per core. Each core has 256 KB of L2 cache, and each processor has 40 MB of L3 cache. Fig 5. Shows how much the size of data (possibly mulAple blocks) that each task operates on varies for different implementaAons and matrix sizes. Fig 7. Shows the performance of the different versions of Jacobi Fig 8. Shows the performance of the different versions of CoMD

Upload: others

Post on 01-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Programming Model Constructs Using ... - NERSC › assets › Uploads › Advanced-Programming-M… · template is “Office”, so you can always return to that aer trying

PosterPrintSize:Thispostertemplateis48”highby36”wide.Itcanbeusedtoprintanyposterwitha4:3aspectraAo.

Placeholders:ThevariouselementsincludedinthisposterareonesweoCenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.

ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.

[Thissidebarareadoesnotprint.]

ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataCertryingsomeofthealternaAves.

PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoC®designthePowerPointsoCware.USandCanada:1-800-790-4001Email:[email protected]

[Thissidebarareadoesnotprint.]

MemoryLayoutProperuseofthehighbandwidthmemorycanhaveamajorimpactonperformance.Thiscaneitherdonewiththememkindlibrary,ornumactl,orbyusingthequadcacheconfiguraAonoftheKNL.ChangingthewaythematrixisallocatedcanalsoimproveperformancewhenoperaAngonblocksofmemory,especiallyonlargermatrixsizes.

BlockingOpGmizaGonsWhenthenumberoftasksaremappedtothenumberofblockseitherthesizeoftheblockgetslargertoaccommodatelargermatrixsizes,orthenumberofblocksincreases.Inordertoavoidtheoverheadoftoomanytasks,orthelossoflocalityfromoversizedblocks,tasksmustbemappedtomulApleblocks,andifpossiblemulApleiteraAonsoverthesameblocks.Figure6showstheperformanceimprovementof3opAmizaAonsfortaskdependencies;combiningmulApleblockspertasks(block),combiningmulApleblocksanditeraAonspertask(block-iter),andarecursivecacheobliviousversion(block-rec)thatoperatesonmulApleblocksanditeraAons.

Fig1.ShowstheperformanceofthedifferentapproachesonKNL

AdvancedProgrammingModelConstructsUsingTaskingontheLatestNERSC(KnightsLanding)Hardware

JeremyKemp1,AliceKoniges2,Yun(Helen)He2,andBarbaraChapman3UniversityofHouston,Houston,TX1

NERSC,LawrenceBerkeleyNaAonalLaboratory,Berkeley,CA2

StonyBrookUniversity,StonyBrook,NY3

1.  hqp://colfaxresearch.com/knl-numa/2.  Heller,Thomas,HartmutKaiser,andKlausIglberger."ApplicaAonoftheParalleXexecuAonmodeltostencil-basedproblems."ComputerScience-ResearchandDevelopment28.2-3(2013):253-261.3.  hqp://www.exmatex.org/comd.html4.  hqp://www.nersc.gov/users/computaAonal-systems/cori/5.  hqps://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/knl-white-boxes/

References

ApplicaGonKernels

PerformanceOverviewTheparallelforloopisthedefaultapproachtoparallelisminOpenMP.ThesetwochartsprovideanoverviewoftaskingperformancerelaAvetothis,andhowmuchbenefitthereisfromaddiAonalopAmizaAons,whicheasilysurpasstheperformanceofworksharing.

JacobiJacobihasveryliqlecachereuse,asitonlywritesagivenelementonceperiteraAon.Asaresult,dividingupthematrixintoblockshasnobenefit,sotheblockcombinaAonopAmizaAonsdon’tapply.IteraAoncombiningispossible,butmorecomplex,duetooverlappingread/writesbetweenneighbors.FortheJacobiiteraAonopAmizaAon,thesecondmatrixisremoved,andreplacedby3threadlocalscratchrowsand4synchronizaAonrowsperchunkofrows.EachtaskperformsaniteraAonfor3rowswriAngtheresultsintothescratchrows,andthenwriAngtheseconditeraAonbacktotheoriginalmatrix.

ApplyingOpGmizaGons

Localityisimportant,anddifficulttodowithOpenMPtasks;theconsistentpoorperformanceoftasks(withouttaskdependencies)showshowmuchoverheadisintroducedandhowmuchlocalityisgivenup.Ontheotherhand,unopAmizedtaskdependenciesdemonstratehowmuchcanbegainedbyremovingunnecessarysynchronizaAon.TheopAmizaAonsonthetaskdependenciescanthenimprovethelocalityandachievemuchbeqerperformancethantheparallelforloops.TherearetwomajordifferenceswhenprogrammingforKNLoverHaswell;KNLhasveryliqlecacheperthreadrelaAvetoHaswell,andusingMCDRAMproperlycandrasAcallyimproveperformance.TheJacobiresultsillustratebothofthesepointsverywell.ThefastestversiononHaswellperformstheworstonKNLduetothecachesize,andtheversionsthatmovethroughmemorysequenAallyperformmuchbeqer.FutureworkincludesfurtherapplicaAonsoftheopAmizaAonsexploredwithLUtoJacobi,CoMD,andotherapplicaAons.

Conclusions

OpGmizingtheLUKernel

LUdecomposiGonTheiniAalmatrixisdividedintoa2Dmatrixofblockstoimprovecacheusage,andenableparallelizaAon.Thereare4disAnctoperaAonsdividedinto3phases.Theworksharing(parallelfor)versiondividestheseinto3phasesforeachiteraAon,withabarrierbetweeneachphase.Thetaskingversionhasasimilarstructuretotheparallelforversion,spawningtasksinsideofloopsandthensynchronizingontaskwaitinsteadofbarriers.ThetaskdependencyversionremovesthesynchronizaAonbetweenphasesaswellasbetweeniteraAons,asIllustratedbydiagram1.LUhasnocommunicaAonorsynchronizaAonbetweenblocks.ThetaskdependenciessimplycontrolaccesstothematrixsoonlyonethreadataAmeiswriAngtoit.

Hardware

JacobiSolverAsdiagram2shows,eachelementofthematrixdependsoneachofitsneighbors.Asaresult,overwriAnganelementwillchangetheresultofitsneighbor,soasecondmatrixistypicallywriqento,andthenswappedwiththeoriginalmatrixattheendofeveryiteraAon.EachversionofJacobidividesupthe2Dmatrixintogroupsofwholerowsinsteadofblocks.ThisreducesfalsesharingandenablesverysequenAalaccessofmemory.SimilartoLU,thetaskingversionissimilartotheparallelforversion,andthetaskdependencyversionremovesthesynchronizaAonbetweeniteraAons.

CoMDDatainCoMDconsistsofatomsin3dimensionalspace,whereeachatomhasaposiAon,velocity,energy,andforce.The3Dspaceisdividedintoboxes,andeachatomisplacedintoabox.EachiteraAonisaAmestepwhereeachoftheaqributesisrecalculatedforeachatom,andthenatomismovedtoitscorrespondingbox,ifitleCitspreviousbox.ThemajorityofthecomputeAmeisspentcalculaAngforcebetweenparAcles,wherethetaskingversionhasonetaskforcalculaAngforcesbetweenatomsinapairofboxes.Theworksharingandtaskdependencyversionsbothdivideuptheworkbyboxes,calculaAngtheinteracAonswithallofitsneighbors.

Cellsnotaffectingcurrentcell

NeighboringCells Currentcelltobeupdated BufferCells

JacobiandusingOpenMPTaskstoParallelizeJacobiCode

OpenMPTasksforaSingleIteration

Task1...Task4

...

OldMatrix NewMatrix

SwapOldandNewMatrixaftereachIteration

...

Cellnext=0.2*(Cellnow+ΣNeighbours)

InputOutput

KNL(FromtheCARLNERSCKNLtestbed)has64cores,eachwith4hardwarethreads.TwocoresaregroupedintopairsasaAleandshare1MBofL2cache.ThereisnoL3cache,butthereis16GBofhighbandwidthmemorythatcanbeconfiguredascacheorallocatedmanuallyindifferentmodes.TheresultsonthisposterusethequadflatandquadcacheconfiguraAons.ThecacheconfiguraAonturnstheMCDRAMintocachethatisnolongerprogrammable,whilethequadflatconfiguraAonenablestheuseofmemkindornumactltomorefinelycontrolhowmemoryisused.

Fig2.ShowstheperformanceofthedifferentapproachesonHaswell

Fig3.ShowstheimprovementfromallocaAnginhighbandwidthmemoryvia

numactl.

Fig4.ShowstheperformancegainedfromallocaAngthematrixincolumnsofblocksinplaceofasingleallocaAon.

Fig6.ShowstheperformancegainedfromcombiningmulApleblocksintoataskandtheperformancegainedby

combiningmulApleblocksanditeraAonspertask.

IntroducGonMostsharedmemoryprogramminginHPCisdonewithhighlysynchronousconstructssuchasthe“parallelfor”inOpenMP.Withtheincreasingcorecountsandnon-uniformityinemerginghardware,amoreasynchronousprogrammingmodelisneeded.ThegoalofthissummerwastoexploreOpenMPtasksontheKnightsLanding(KNL)hardwarethatwillbeusedinCoriPhaseIItodemonstratethepotenAalbenefitsofanasynchronousprogrammingmodel.Thisisdonewithtwokernels,LUdecomposiAonandaniteraAveJacobKernel,aswellasaproxyapplicaAon,CoMD1.ForeachoftheseapplicaAons,ataskingversionwithoutdatadependencieswasdevelopedtoshowtheperformancecostofmovingtotasks.ThenaversionwithtaskdependenciesshowstheimprovementsthatcanbegainedfromremovingunnecessarysynchronizaAon,andfinallyseveralopAmizaAonsonthetaskdependencyversionshowhowmuchaddiAonalperformancecanbegained.1unopAmizedtaskingversionprovidedbyRiyazHaquw(UCLA)andBronisdeSupinski(LLNL)

Diagram1.ShowsthedependenciesbetweenblocksintheLUKernel.

Diagram2.ShowsthedependenciesbetweenblocksintheJacobiKernel.

Diagram3.Showsthedivisionofspaceintoboxesandtheatomsinvolvedinasingle

calculaAon

CoMDWithCoMD,theiniAalconversionoftheforcefuncAontotaskdependenciesimprovedperformance.Theall-task-depconversionoftheapplicaAonreplacedallparallelismandsynchronizaAonintheapplicaAonwithtaskdependencies,includingserialregionswithdatadependencies.CombiningblockswouldhavebeenabeqerfirstopAmizaAon,asitwouldhaveonlyreducedoverheadwherethefullconversionintroducedtoomuchoverheadandhurtperformance.

WholerowsaretoolargetofitintheL2oftheKNL,sothecachereusedoesnotimprove.WhereastheHaswellperformancemorethandoublesduetototherowsfixngintotheverylargeL3cache.FurtherimplementaAonworkisneededonaniteraAoncombiningversionthatoperatesonblocksthatfitinsideofsmallerCaches.

Forcomparison,CoriPhase1nodeshave2Haswellprocessorswithatotalof32cores,2hyperthreadspercore.Eachcorehas256KBofL2cache,andeachprocessorhas40MBofL3cache.

Fig5.Showshowmuchthesizeofdata(possiblymulApleblocks)thateachtask

operatesonvariesfordifferentimplementaAonsandmatrixsizes.

Fig7.ShowstheperformanceofthedifferentversionsofJacobi

Fig8.ShowstheperformanceofthedifferentversionsofCoMD