advanced programming model constructs using ... - nersc › assets › uploads ›...
TRANSCRIPT
PosterPrintSize:Thispostertemplateis48”highby36”wide.Itcanbeusedtoprintanyposterwitha4:3aspectraAo.
Placeholders:ThevariouselementsincludedinthisposterareonesweoCenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.
ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.
[Thissidebarareadoesnotprint.]
ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataCertryingsomeofthealternaAves.
PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoC®designthePowerPointsoCware.USandCanada:1-800-790-4001Email:[email protected]
[Thissidebarareadoesnotprint.]
MemoryLayoutProperuseofthehighbandwidthmemorycanhaveamajorimpactonperformance.Thiscaneitherdonewiththememkindlibrary,ornumactl,orbyusingthequadcacheconfiguraAonoftheKNL.ChangingthewaythematrixisallocatedcanalsoimproveperformancewhenoperaAngonblocksofmemory,especiallyonlargermatrixsizes.
BlockingOpGmizaGonsWhenthenumberoftasksaremappedtothenumberofblockseitherthesizeoftheblockgetslargertoaccommodatelargermatrixsizes,orthenumberofblocksincreases.Inordertoavoidtheoverheadoftoomanytasks,orthelossoflocalityfromoversizedblocks,tasksmustbemappedtomulApleblocks,andifpossiblemulApleiteraAonsoverthesameblocks.Figure6showstheperformanceimprovementof3opAmizaAonsfortaskdependencies;combiningmulApleblockspertasks(block),combiningmulApleblocksanditeraAonspertask(block-iter),andarecursivecacheobliviousversion(block-rec)thatoperatesonmulApleblocksanditeraAons.
Fig1.ShowstheperformanceofthedifferentapproachesonKNL
AdvancedProgrammingModelConstructsUsingTaskingontheLatestNERSC(KnightsLanding)Hardware
JeremyKemp1,AliceKoniges2,Yun(Helen)He2,andBarbaraChapman3UniversityofHouston,Houston,TX1
NERSC,LawrenceBerkeleyNaAonalLaboratory,Berkeley,CA2
StonyBrookUniversity,StonyBrook,NY3
1. hqp://colfaxresearch.com/knl-numa/2. Heller,Thomas,HartmutKaiser,andKlausIglberger."ApplicaAonoftheParalleXexecuAonmodeltostencil-basedproblems."ComputerScience-ResearchandDevelopment28.2-3(2013):253-261.3. hqp://www.exmatex.org/comd.html4. hqp://www.nersc.gov/users/computaAonal-systems/cori/5. hqps://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/knl-white-boxes/
References
ApplicaGonKernels
PerformanceOverviewTheparallelforloopisthedefaultapproachtoparallelisminOpenMP.ThesetwochartsprovideanoverviewoftaskingperformancerelaAvetothis,andhowmuchbenefitthereisfromaddiAonalopAmizaAons,whicheasilysurpasstheperformanceofworksharing.
JacobiJacobihasveryliqlecachereuse,asitonlywritesagivenelementonceperiteraAon.Asaresult,dividingupthematrixintoblockshasnobenefit,sotheblockcombinaAonopAmizaAonsdon’tapply.IteraAoncombiningispossible,butmorecomplex,duetooverlappingread/writesbetweenneighbors.FortheJacobiiteraAonopAmizaAon,thesecondmatrixisremoved,andreplacedby3threadlocalscratchrowsand4synchronizaAonrowsperchunkofrows.EachtaskperformsaniteraAonfor3rowswriAngtheresultsintothescratchrows,andthenwriAngtheseconditeraAonbacktotheoriginalmatrix.
ApplyingOpGmizaGons
Localityisimportant,anddifficulttodowithOpenMPtasks;theconsistentpoorperformanceoftasks(withouttaskdependencies)showshowmuchoverheadisintroducedandhowmuchlocalityisgivenup.Ontheotherhand,unopAmizedtaskdependenciesdemonstratehowmuchcanbegainedbyremovingunnecessarysynchronizaAon.TheopAmizaAonsonthetaskdependenciescanthenimprovethelocalityandachievemuchbeqerperformancethantheparallelforloops.TherearetwomajordifferenceswhenprogrammingforKNLoverHaswell;KNLhasveryliqlecacheperthreadrelaAvetoHaswell,andusingMCDRAMproperlycandrasAcallyimproveperformance.TheJacobiresultsillustratebothofthesepointsverywell.ThefastestversiononHaswellperformstheworstonKNLduetothecachesize,andtheversionsthatmovethroughmemorysequenAallyperformmuchbeqer.FutureworkincludesfurtherapplicaAonsoftheopAmizaAonsexploredwithLUtoJacobi,CoMD,andotherapplicaAons.
Conclusions
OpGmizingtheLUKernel
LUdecomposiGonTheiniAalmatrixisdividedintoa2Dmatrixofblockstoimprovecacheusage,andenableparallelizaAon.Thereare4disAnctoperaAonsdividedinto3phases.Theworksharing(parallelfor)versiondividestheseinto3phasesforeachiteraAon,withabarrierbetweeneachphase.Thetaskingversionhasasimilarstructuretotheparallelforversion,spawningtasksinsideofloopsandthensynchronizingontaskwaitinsteadofbarriers.ThetaskdependencyversionremovesthesynchronizaAonbetweenphasesaswellasbetweeniteraAons,asIllustratedbydiagram1.LUhasnocommunicaAonorsynchronizaAonbetweenblocks.ThetaskdependenciessimplycontrolaccesstothematrixsoonlyonethreadataAmeiswriAngtoit.
Hardware
JacobiSolverAsdiagram2shows,eachelementofthematrixdependsoneachofitsneighbors.Asaresult,overwriAnganelementwillchangetheresultofitsneighbor,soasecondmatrixistypicallywriqento,andthenswappedwiththeoriginalmatrixattheendofeveryiteraAon.EachversionofJacobidividesupthe2Dmatrixintogroupsofwholerowsinsteadofblocks.ThisreducesfalsesharingandenablesverysequenAalaccessofmemory.SimilartoLU,thetaskingversionissimilartotheparallelforversion,andthetaskdependencyversionremovesthesynchronizaAonbetweeniteraAons.
CoMDDatainCoMDconsistsofatomsin3dimensionalspace,whereeachatomhasaposiAon,velocity,energy,andforce.The3Dspaceisdividedintoboxes,andeachatomisplacedintoabox.EachiteraAonisaAmestepwhereeachoftheaqributesisrecalculatedforeachatom,andthenatomismovedtoitscorrespondingbox,ifitleCitspreviousbox.ThemajorityofthecomputeAmeisspentcalculaAngforcebetweenparAcles,wherethetaskingversionhasonetaskforcalculaAngforcesbetweenatomsinapairofboxes.Theworksharingandtaskdependencyversionsbothdivideuptheworkbyboxes,calculaAngtheinteracAonswithallofitsneighbors.
Cellsnotaffectingcurrentcell
NeighboringCells Currentcelltobeupdated BufferCells
JacobiandusingOpenMPTaskstoParallelizeJacobiCode
OpenMPTasksforaSingleIteration
Task1...Task4
...
OldMatrix NewMatrix
SwapOldandNewMatrixaftereachIteration
...
Cellnext=0.2*(Cellnow+ΣNeighbours)
InputOutput
KNL(FromtheCARLNERSCKNLtestbed)has64cores,eachwith4hardwarethreads.TwocoresaregroupedintopairsasaAleandshare1MBofL2cache.ThereisnoL3cache,butthereis16GBofhighbandwidthmemorythatcanbeconfiguredascacheorallocatedmanuallyindifferentmodes.TheresultsonthisposterusethequadflatandquadcacheconfiguraAons.ThecacheconfiguraAonturnstheMCDRAMintocachethatisnolongerprogrammable,whilethequadflatconfiguraAonenablestheuseofmemkindornumactltomorefinelycontrolhowmemoryisused.
Fig2.ShowstheperformanceofthedifferentapproachesonHaswell
Fig3.ShowstheimprovementfromallocaAnginhighbandwidthmemoryvia
numactl.
Fig4.ShowstheperformancegainedfromallocaAngthematrixincolumnsofblocksinplaceofasingleallocaAon.
Fig6.ShowstheperformancegainedfromcombiningmulApleblocksintoataskandtheperformancegainedby
combiningmulApleblocksanditeraAonspertask.
IntroducGonMostsharedmemoryprogramminginHPCisdonewithhighlysynchronousconstructssuchasthe“parallelfor”inOpenMP.Withtheincreasingcorecountsandnon-uniformityinemerginghardware,amoreasynchronousprogrammingmodelisneeded.ThegoalofthissummerwastoexploreOpenMPtasksontheKnightsLanding(KNL)hardwarethatwillbeusedinCoriPhaseIItodemonstratethepotenAalbenefitsofanasynchronousprogrammingmodel.Thisisdonewithtwokernels,LUdecomposiAonandaniteraAveJacobKernel,aswellasaproxyapplicaAon,CoMD1.ForeachoftheseapplicaAons,ataskingversionwithoutdatadependencieswasdevelopedtoshowtheperformancecostofmovingtotasks.ThenaversionwithtaskdependenciesshowstheimprovementsthatcanbegainedfromremovingunnecessarysynchronizaAon,andfinallyseveralopAmizaAonsonthetaskdependencyversionshowhowmuchaddiAonalperformancecanbegained.1unopAmizedtaskingversionprovidedbyRiyazHaquw(UCLA)andBronisdeSupinski(LLNL)
Diagram1.ShowsthedependenciesbetweenblocksintheLUKernel.
Diagram2.ShowsthedependenciesbetweenblocksintheJacobiKernel.
Diagram3.Showsthedivisionofspaceintoboxesandtheatomsinvolvedinasingle
calculaAon
CoMDWithCoMD,theiniAalconversionoftheforcefuncAontotaskdependenciesimprovedperformance.Theall-task-depconversionoftheapplicaAonreplacedallparallelismandsynchronizaAonintheapplicaAonwithtaskdependencies,includingserialregionswithdatadependencies.CombiningblockswouldhavebeenabeqerfirstopAmizaAon,asitwouldhaveonlyreducedoverheadwherethefullconversionintroducedtoomuchoverheadandhurtperformance.
WholerowsaretoolargetofitintheL2oftheKNL,sothecachereusedoesnotimprove.WhereastheHaswellperformancemorethandoublesduetototherowsfixngintotheverylargeL3cache.FurtherimplementaAonworkisneededonaniteraAoncombiningversionthatoperatesonblocksthatfitinsideofsmallerCaches.
Forcomparison,CoriPhase1nodeshave2Haswellprocessorswithatotalof32cores,2hyperthreadspercore.Eachcorehas256KBofL2cache,andeachprocessorhas40MBofL3cache.
Fig5.Showshowmuchthesizeofdata(possiblymulApleblocks)thateachtask
operatesonvariesfordifferentimplementaAonsandmatrixsizes.
Fig7.ShowstheperformanceofthedifferentversionsofJacobi
Fig8.ShowstheperformanceofthedifferentversionsofCoMD