advanced programming model constructs using ... - nersc › assets › uploads ›...

PosterPrintSize:Thispostertemplateis48”highby36”wide.Itcanbeusedtoprintanyposterwitha4:3aspectraAo.

Placeholders:ThevariouselementsincludedinthisposterareonesweoCenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.

ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.

[Thissidebarareadoesnotprint.]

ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataCertryingsomeofthealternaAves.

PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoC®designthePowerPointsoCware.USandCanada:1-800-790-4001Email:[email protected]

[Thissidebarareadoesnotprint.]

MemoryLayoutProperuseofthehighbandwidthmemorycanhaveamajorimpactonperformance.Thiscaneitherdonewiththememkindlibrary,ornumactl,orbyusingthequadcacheconfiguraAonoftheKNL.ChangingthewaythematrixisallocatedcanalsoimproveperformancewhenoperaAngonblocksofmemory,especiallyonlargermatrixsizes.

BlockingOpGmizaGonsWhenthenumberoftasksaremappedtothenumberofblockseitherthesizeoftheblockgetslargertoaccommodatelargermatrixsizes,orthenumberofblocksincreases.Inordertoavoidtheoverheadoftoomanytasks,orthelossoflocalityfromoversizedblocks,tasksmustbemappedtomulApleblocks,andifpossiblemulApleiteraAonsoverthesameblocks.Figure6showstheperformanceimprovementof3opAmizaAonsfortaskdependencies;combiningmulApleblockspertasks(block),combiningmulApleblocksanditeraAonspertask(block-iter),andarecursivecacheobliviousversion(block-rec)thatoperatesonmulApleblocksanditeraAons.

Fig1.ShowstheperformanceofthedifferentapproachesonKNL

AdvancedProgrammingModelConstructsUsingTaskingontheLatestNERSC(KnightsLanding)Hardware

JeremyKemp1,AliceKoniges2,Yun(Helen)He2,andBarbaraChapman3UniversityofHouston,Houston,TX1

NERSC,LawrenceBerkeleyNaAonalLaboratory,Berkeley,CA2

StonyBrookUniversity,StonyBrook,NY3

1.  hqp://colfaxresearch.com/knl-numa/2.  Heller,Thomas,HartmutKaiser,andKlausIglberger."ApplicaAonoftheParalleXexecuAonmodeltostencil-basedproblems."ComputerScience-ResearchandDevelopment28.2-3(2013):253-261.3.  hqp://www.exmatex.org/comd.html4.  hqp://www.nersc.gov/users/computaAonal-systems/cori/5.  hqps://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/knl-white-boxes/

References

ApplicaGonKernels

PerformanceOverviewTheparallelforloopisthedefaultapproachtoparallelisminOpenMP.ThesetwochartsprovideanoverviewoftaskingperformancerelaAvetothis,andhowmuchbenefitthereisfromaddiAonalopAmizaAons,whicheasilysurpasstheperformanceofworksharing.

JacobiJacobihasveryliqlecachereuse,asitonlywritesagivenelementonceperiteraAon.Asaresult,dividingupthematrixintoblockshasnobenefit,sotheblockcombinaAonopAmizaAonsdon’tapply.IteraAoncombiningispossible,butmorecomplex,duetooverlappingread/writesbetweenneighbors.FortheJacobiiteraAonopAmizaAon,thesecondmatrixisremoved,andreplacedby3threadlocalscratchrowsand4synchronizaAonrowsperchunkofrows.EachtaskperformsaniteraAonfor3rowswriAngtheresultsintothescratchrows,andthenwriAngtheseconditeraAonbacktotheoriginalmatrix.

ApplyingOpGmizaGons

Localityisimportant,anddifficulttodowithOpenMPtasks;theconsistentpoorperformanceoftasks(withouttaskdependencies)showshowmuchoverheadisintroducedandhowmuchlocalityisgivenup.Ontheotherhand,unopAmizedtaskdependenciesdemonstratehowmuchcanbegainedbyremovingunnecessarysynchronizaAon.TheopAmizaAonsonthetaskdependenciescanthenimprovethelocalityandachievemuchbeqerperformancethantheparallelforloops.TherearetwomajordifferenceswhenprogrammingforKNLoverHaswell;KNLhasveryliqlecacheperthreadrelaAvetoHaswell,andusingMCDRAMproperlycandrasAcallyimproveperformance.TheJacobiresultsillustratebothofthesepointsverywell.ThefastestversiononHaswellperformstheworstonKNLduetothecachesize,andtheversionsthatmovethroughmemorysequenAallyperformmuchbeqer.FutureworkincludesfurtherapplicaAonsoftheopAmizaAonsexploredwithLUtoJacobi,CoMD,andotherapplicaAons.

Conclusions

OpGmizingtheLUKernel

LUdecomposiGonTheiniAalmatrixisdividedintoa2Dmatrixofblockstoimprovecacheusage,andenableparallelizaAon.Thereare4disAnctoperaAonsdividedinto3phases.Theworksharing(parallelfor)versiondividestheseinto3phasesforeachiteraAon,withabarrierbetweeneachphase.Thetaskingversionhasasimilarstructuretotheparallelforversion,spawningtasksinsideofloopsandthensynchronizingontaskwaitinsteadofbarriers.ThetaskdependencyversionremovesthesynchronizaAonbetweenphasesaswellasbetweeniteraAons,asIllustratedbydiagram1.LUhasnocommunicaAonorsynchronizaAonbetweenblocks.ThetaskdependenciessimplycontrolaccesstothematrixsoonlyonethreadataAmeiswriAngtoit.

Hardware

JacobiSolverAsdiagram2shows,eachelementofthematrixdependsoneachofitsneighbors.Asaresult,overwriAnganelementwillchangetheresultofitsneighbor,soasecondmatrixistypicallywriqento,andthenswappedwiththeoriginalmatrixattheendofeveryiteraAon.EachversionofJacobidividesupthe2Dmatrixintogroupsofwholerowsinsteadofblocks.ThisreducesfalsesharingandenablesverysequenAalaccessofmemory.SimilartoLU,thetaskingversionissimilartotheparallelforversion,andthetaskdependencyversionremovesthesynchronizaAonbetweeniteraAons.

CoMDDatainCoMDconsistsofatomsin3dimensionalspace,whereeachatomhasaposiAon,velocity,energy,andforce.The3Dspaceisdividedintoboxes,andeachatomisplacedintoabox.EachiteraAonisaAmestepwhereeachoftheaqributesisrecalculatedforeachatom,andthenatomismovedtoitscorrespondingbox,ifitleCitspreviousbox.ThemajorityofthecomputeAmeisspentcalculaAngforcebetweenparAcles,wherethetaskingversionhasonetaskforcalculaAngforcesbetweenatomsinapairofboxes.Theworksharingandtaskdependencyversionsbothdivideuptheworkbyboxes,calculaAngtheinteracAonswithallofitsneighbors.

Cellsnotaffectingcurrentcell

NeighboringCells Currentcelltobeupdated BufferCells

JacobiandusingOpenMPTaskstoParallelizeJacobiCode

OpenMPTasksforaSingleIteration

Task1...Task4

...

OldMatrix NewMatrix

SwapOldandNewMatrixaftereachIteration

...

Cellnext=0.2*(Cellnow+ΣNeighbours)

InputOutput

KNL(FromtheCARLNERSCKNLtestbed)has64cores,eachwith4hardwarethreads.TwocoresaregroupedintopairsasaAleandshare1MBofL2cache.ThereisnoL3cache,butthereis16GBofhighbandwidthmemorythatcanbeconfiguredascacheorallocatedmanuallyindifferentmodes.TheresultsonthisposterusethequadflatandquadcacheconfiguraAons.ThecacheconfiguraAonturnstheMCDRAMintocachethatisnolongerprogrammable,whilethequadflatconfiguraAonenablestheuseofmemkindornumactltomorefinelycontrolhowmemoryisused.

Fig2.ShowstheperformanceofthedifferentapproachesonHaswell

Fig3.ShowstheimprovementfromallocaAnginhighbandwidthmemoryvia

numactl.

Fig4.ShowstheperformancegainedfromallocaAngthematrixincolumnsofblocksinplaceofasingleallocaAon.

Fig6.ShowstheperformancegainedfromcombiningmulApleblocksintoataskandtheperformancegainedby

combiningmulApleblocksanditeraAonspertask.

IntroducGonMostsharedmemoryprogramminginHPCisdonewithhighlysynchronousconstructssuchasthe“parallelfor”inOpenMP.Withtheincreasingcorecountsandnon-uniformityinemerginghardware,amoreasynchronousprogrammingmodelisneeded.ThegoalofthissummerwastoexploreOpenMPtasksontheKnightsLanding(KNL)hardwarethatwillbeusedinCoriPhaseIItodemonstratethepotenAalbenefitsofanasynchronousprogrammingmodel.Thisisdonewithtwokernels,LUdecomposiAonandaniteraAveJacobKernel,aswellasaproxyapplicaAon,CoMD1.ForeachoftheseapplicaAons,ataskingversionwithoutdatadependencieswasdevelopedtoshowtheperformancecostofmovingtotasks.ThenaversionwithtaskdependenciesshowstheimprovementsthatcanbegainedfromremovingunnecessarysynchronizaAon,andfinallyseveralopAmizaAonsonthetaskdependencyversionshowhowmuchaddiAonalperformancecanbegained.1unopAmizedtaskingversionprovidedbyRiyazHaquw(UCLA)andBronisdeSupinski(LLNL)

Diagram1.ShowsthedependenciesbetweenblocksintheLUKernel.

Diagram2.ShowsthedependenciesbetweenblocksintheJacobiKernel.

Diagram3.Showsthedivisionofspaceintoboxesandtheatomsinvolvedinasingle

calculaAon

CoMDWithCoMD,theiniAalconversionoftheforcefuncAontotaskdependenciesimprovedperformance.Theall-task-depconversionoftheapplicaAonreplacedallparallelismandsynchronizaAonintheapplicaAonwithtaskdependencies,includingserialregionswithdatadependencies.CombiningblockswouldhavebeenabeqerfirstopAmizaAon,asitwouldhaveonlyreducedoverheadwherethefullconversionintroducedtoomuchoverheadandhurtperformance.

WholerowsaretoolargetofitintheL2oftheKNL,sothecachereusedoesnotimprove.WhereastheHaswellperformancemorethandoublesduetototherowsfixngintotheverylargeL3cache.FurtherimplementaAonworkisneededonaniteraAoncombiningversionthatoperatesonblocksthatfitinsideofsmallerCaches.

Forcomparison,CoriPhase1nodeshave2Haswellprocessorswithatotalof32cores,2hyperthreadspercore.Eachcorehas256KBofL2cache,andeachprocessorhas40MBofL3cache.

Fig5.Showshowmuchthesizeofdata(possiblymulApleblocks)thateachtask

operatesonvariesfordifferentimplementaAonsandmatrixsizes.

Fig7.ShowstheperformanceofthedifferentversionsofJacobi

Fig8.ShowstheperformanceofthedifferentversionsofCoMD

advanced programming model constructs using ... - nersc › assets › uploads ›...

Documents