for most users and applicaons, using default sengs work · 2010. 11. 3. · gnu so‐so fortran,...
TRANSCRIPT
Formostusersandapplica1ons,usingdefaultse5ngsworkverywell
Foruserswhowanttoexperimenttogetthebestperformancetheycan,thefollowingpresenta1ongivesyousomeinforma1ononcompilersandse5ngstotry Whileitdoesn’tcoverabsolutelyeverything,thepresenta7ontriestoaddresssomeofthetunableparameterswhichwehavefoundtoprovideincreasedperformanceinthesitua7onsdiscussed
Cray Inc. Preliminary and Proprietary 2
xtpe‐mc12Ifnomoduleisloaded,andno‘arch’specifiedinthecompilerop;ons,thecompilersdefaulttothenodetypeonwhichthecompilerisrunning:Whichmaynotbethesameasthecompute
nodes!
Cray Inc. Preliminary and Proprietary 3
Thebestcompilerisnotthesameforeveryapplica1on
Cray Inc. Preliminary and Proprietary 4
PGI–VerygoodFortran,okayCandC++ Goodvectoriza7on Goodfunc7onalcorrectnesswithop7miza7onenabled Goodmanualandautoma7cprefetchcapabili7es VeryinterestedintheLinuxHPCmarket,althoughthatisnottheironlyfocus Excellentworkingrela7onshipwithCray,goodbugresponsiveness
Pathscale–GoodFortran,C,probablygoodC++ Outstandingscalarop7miza7onforloopsthatdonotvectorize FortranfrontendusesanolderversionoftheCCEFortranfrontend OpenMPusesanon‐pthreadsapproach Scalarbenefitswillnotgetasmuchmileagewithlongervectors
(NotNERSCsupported)Intel–GoodFortran,excellentCandC++(ifyouignorevectoriza1on) Automa7cvectoriza7oncapabili7esaremodest,comparedtoPGIandCCE Useofinlineassemblyisencouraged Focusismoreonbestspeedforscalar,non‐scalingapps TunedforIntelarchitectures,butactuallyworkswellforsomeapplica7onson
AMD
…from Cray’s Perspective
5 Cray Inc. Preliminary and Proprietary
GNUso‐soFortran,outstandingCandC++(ifyouignorevectoriza1on) Obviously,thebestforgcccompatability Scalarop7mizerwasrecentlyrewriPenandisverygood Vectoriza7oncapabili7esfocusmostlyoninlineassembly Notethelastthreereleaseshavebeenincompa7blewitheachother(4.3,4.4,
and4.5)andrequiredrecompila7onofFortranmodules CCE–OutstandingFortran,verygoodC,andokayC++
Verygoodvectoriza7on VerygoodFortranlanguagesupport;onlyrealchoiceforCoarrays Csupportisquitegood,withUPCsupport Verygoodscalarop7miza7onandautoma7cparalleliza7on Cleanimplementa7onofOpenMP3.0,withtasks SoledeliveryfocusisonLinux‐basedCrayhardwaresystems Bestbugturnaround7me(ifitisn’t,letusknow!) Cleanestintegra7onwithotherCraytools(performancetools,debuggers,
upcomingproduc7vitytools) Noinlineassemblysupport
6 Cray Inc. Preliminary and Proprietary
…from Cray’s Perspective
Usedefaultop1miza1onlevels It’stheequivalentofmostothercompilers–O3or–fast
Use–O3,fp3(or–O3–hfp3,orsomevaria1on) ‐O3onlygivesyouslightlymorethan–O2 ‐hfp3givesyoualotmorefloa7ngpointop7miza7on,esp.32‐bit
Ifanapplica1onisintolerantoffloa1ngpointreassocia1on,tryalower–hfpnumber–try–hfp1first,only–hfp0ifabsolutelynecessary MightbeneededforteststhatrequirestrictIEEEconformance Orapplica7onsthathave‘validated’resultsfromadifferentcompiler
Donotsuggestusing–Oipa5,‐Oaggress,andsoon–highernumbersarenotalwayscorrelatedwithbeVerperformance
Compilerfeedback:‐rm(Fortran)‐hlist=m(C) Ifyouknowyoudon’twantOpenMP:‐xompor‐Othread0 mancray[n;mancraycc;mancrayCC
7 Cray Inc. Preliminary and Proprietary
PGI ‐fast–Mipa=fast(,safe) Ifyoucanbeflexiblewithprecision,alsotry‐Mfprelaxed Compilerfeedback:‐Minfo=all‐Mneginfo manpgf90;manpgcc;manpgCC;orpgf90‐help
Pathscale ‐OfastNote:thisisaliPlelooserwithprecisionthanothercompilers Compilerfeedback:‐LNO:simd_verbose=ON maneko(“EveryKnownOp7miza7on”)
GNU ‐O3–ffast‐math–funroll‐loops Compilerfeedback:‐iree‐vectorizer‐verbose=2 mangfortran;mangcc;mang++
Intel ‐fast Compilerfeedback: manifort;manicc;maniCC
Cray Inc. Preliminary and Proprietary 8
Usethextpe‐mc12moduleanditisallautoma1c
TheOpenMPthreadedBLAS/LAPACKlibraryisthedefaultifthextpe‐mc12moduleisloaded.Theserialversionisusedif
‘OMP_NUM_THREADS’isnotsetorsetto1.
Cray Inc. Preliminary and Proprietary 9
ExperimentwiththeEnvironmentVariableGrabBag
Cray Inc. Preliminary and Proprietary 10
OnlyrelevantformixedMPI/SHMEM/UPC/CAFcodes NormallywanttoleaveenabledsoMPICH2andDMAPPcansharethe
samememoryregistra1oncache Mayhavetodisableforcodesthatcallshmem_inita[erMPI_Init. Mayhavetosettodisableifonegetsatracebacklikethis:
Cray Inc. Preliminary and Proprietary 11
Rank 834 Fatal error in MPI_Alltoall: Other MPI error, error stack: MPI_Alltoall(768).......................: MPI_Alltoall(sbuf=0x2aab9c301010, scount=2596, MPI_DOUBLE, rbuf=0x2aab7ae01010, rcount=2596, MPI_DOUBLE, comm=0x84000004) failed MPIR_Alltoall(469)......................: MPIC_Isend(453).........................: MPID_nem_lmt_RndvSend(102)..............: MPID_nem_gni_lmt_initiate_lmt(580)......: failure occurred while attempting to send RTS packet MPID_nem_gni_iStartContigMsg(869).......: MPID_nem_gni_iSendContig_start(763).....: MPID_nem_gni_send_conn_req(626).........: MPID_nem_gni_progress_send_conn_req(193): MPID_nem_gni_smsg_mbox_alloc(357).......: MPID_nem_gni_smsg_mbox_block_alloc(268).: GNI_MemRegister GNI_RC_ERROR_RESOURCE)
Defaultis8192bytes Maximumsizemessagethatcangothroughtheeagerprotocol. Mayhelpforappsthataresendingmediumsizemessages,anddobeVer
whenlooselycoupled.Doesapplica1onhavealargeamountof1meinMPI_Waitall?Se5ngthisenvironmentvariablehighermayhelp.
Maxvalueis131072bytes. Rememberforthispathithelpstopre‐postreceivesifpossible. Notethata40‐byteCH3headerisincludedwhenaccoun1ngforthe
messagesize.
Cray Inc. Preliminary and Proprietary 12
Defaultisenabled Controlswhetherornottousealazymemoryderegistra1onpolicyinside
UDREG. Mayhelpforappsthat
TransferringalotofmessagesviaLMTpath Anduse4KBpages
Disablingresultsinasignificantdropinmeasuredbandwidthforlargetransfers~40‐50%.
MaybebeVerofftryingtouselargepagesthanse5ngthisto'disabled'.
Cray Inc. Preliminary and Proprietary 13
Defaultis6432Kbuffers(2Mtotal) Controlsnumberof32KDMAbuffersavailableforeachranktouseinthe
Eagerprotocoldescribedearlier Mayhelptomodestlyincrease.Butotherresourcesconstrainthe
usabilityofalargenumberofbuffers.
Cray Inc. Preliminary and Proprietary 14
Defaultvalueis1024. ControlsthethresholdatwhichtheGNInetmodswitchesfromusingFMA
forRDMAread/writeopera1onstousingtheBTE. SinceBTEismanagedinthekernel,BTEini1atedRDMArequestscan
progresseveniftheapplica1onisn'tinMPI,allowingpossiblyforslightlybeVerchancesofge5ngsomeoverlapofcommunica1onwithcomputa1on.
OwingtoOpteron/HTquirks,theBTEiso[enbeVerformovingdatato/frommemoriesthatarefartherfromtheGemini.
Cray Inc. Preliminary and Proprietary 15
Defaultvalueis8192bytes. Specifiesthresholdatwhichthesharedmemorychannelswitchestoa
single‐copy(XPMEM)protocolforintra‐nodemessagesfromadoublecopyprotocol.
Cray Inc. Preliminary and Proprietary 16
Star1ngwithMPT5.0.2,thedefaultissinglecopyviaXPMEMisenabled(=‘0’).InolderversionssinglecopyviaXPMEMisdisabled(=‘1’).
SpecifieswhetherornottouseaXPMEM‐basedsingle‐copyprotocolforintra‐nodemessagesofsizeMPICH_SMP_SINGLE_COPY_SIZEbytesorlarger.
Cray Inc. Preliminary and Proprietary 17
Thisallowsformoreasyncmessagetransfer.
Buttheaddi1onalcopyonthereceivingsidemayoffsetthegain.
Thisisimportantenoughthatwearemen;oningtwice!
Cray Inc. Preliminary and Proprietary 18
MemoryAlloca1on:Makeitlocal
Cray Inc. Preliminary and Proprietary 19
Linuxhasa“firsttouchpolicy”formemoryalloca1on *allocfunc7onsdon’tactuallyallocateyourmemory Memorygetsallocatedwhen“touched”
Problem:Acodecanallocatemorememorythanavailable Linuxassumes“swapspace,”wedon’thaveany Applica7onswon’tfailfromover‐alloca7onun7lthememoryisfinallytouched
Problem:Memorywillbeputonthecoreofthe“touching”thread Onlyaproblemifthread0allocatesallmemoryforanode
Solu1on:Alwaysini1alizeyourmemoryimmediatelya[eralloca1ngit Ifyouover‐allocate,itwillfailimmediately,ratherthanastrangeplaceinyour
code Ifeverythreadtouchesitsownmemory,itwillbeallocatedontheproper
socket/die.
Cray Inc. Preliminary and Proprietary 20
Isyournearestneighborreallyyournearestneighbor?Anddoyouwantthemtobeyournearestneighbor?
Cray Inc. Preliminary and Proprietary 21
Thedefaultorderingcanbechangedusingthefollowingenvironmentvariable: MPICH_RANK_REORDER_METHOD
Thesearethedifferentvaluesthatyoucansetitto: 0:Round‐robinplacement–Sequen7alranksareplacedonthenextnodeinthe
list.Placementstartsoverwiththefirstnodeuponreachingtheendofthelist. 1:(DEFAULT)SMP‐styleplacement–Sequen7alranksfillupeachnodebefore
movingtothenext. 2:Foldedrankplacement–Similartoround‐robinplacementexceptthateach
passoverthenodelistisintheoppositedirec7onofthepreviouspass. 3:Customordering.Theorderingisspecifiedinafilenamed
MPICH_RANK_ORDER. Whenisthisuseful?
Point‐to‐pointcommunica7onconsumesasignificantfrac7onofprogram7meandaloadimbalancedetected
Alsoshowntohelpforcollec7ves(alltoall)onsubcommunicators SpreadoutIOacrossnodes
Cray Inc. Preliminary and Proprietary 22
GYRO8.0 B3‐GTCproblemwith1024processes
RunwithalternateMPIorderings
Cray Inc. Preliminary and Proprietary 23
Reordermethod Comm.1me
1–SMP(Default) 11.26s
0–round‐robin 6.94s
2–folded‐rank 6.68s
Note:• Therankreorderingonlyworksonnodes.Ifyouwanttopackwithinanodeinaspecialwayusetheaprun–cc‘cpulist’.• Hencetogetabitmoreoutofthefolded‐rankop7onuseaprun–cc0,6,12,18,19,13,7,1,2,8,14,20,21,15,9,3,4,10,16,22,23,17,11,5Thisfoldsacrossnodes,andfoldswithindiesonanode.
TGYRO1.0 SteadystateturbulenttransportcodeusingGYRO,NEO,TGLFcomponents
ASTRAtestcase TestedMPIorderingsatlargescale Originallytes7ngweak‐scaling,butfoundreorderingveryuseful
Reordermethod
TGYROwall1me(min)
20480Cores
40960Cores
81920Cores
Default 99m 104m 105m
Round‐robin 66m 63m 72m Huge win!
Cray Inc. Preliminary and Proprietary 24
Cray Inc. Preliminary and Proprietary 25
Communica1onisnearest‐neighbor.(Haloexchange)
Defaultorderingresultsin12x1x1blockoneachnode.(Istanbulexample,12cores)
Acustomreorderingisnowgenerated:3x2x2blockspernode,resul1nginmoreon‐nodecommunica1on
Applica1ondataisina3Dspace,X*Y*Z
Note:Usingblocksorslabswithinanodemayhelpsomecommunica1ons.Fora6x4chunk,youcouldtrya46x1,or43x2chunks,witheachdiege5ngonechunk.
Lower is better
X X o o
X X o o
o o o o
o o o o
NodesmarkedXheavilyuseasharedresource
Ifthesharedresourceis: Memorybandwidth:scaPertheX's Networkbandwidthtoothers,againscaPer
Networkbandwidthamongthemselves,concentrate
Checkoutpat_report,grid_order,andmgrid_orderforgenera1ngcustomrankordersbasedon: Measureddata Communica7onpaPerns Datadecomposi7on
Cray Inc. Preliminary and Proprietary Slide 26
GeminilovestouseHugepages
Cray Inc. Preliminary and Proprietary 27
TheGeminiperformbeVerwithHUGEpagesthanwith4Kpages.
HUGEpagesuselessGEMINIresourcesthan4kpages(fewerbytes).
YourcodemayrunwithfewerTLBmisses(hencefaster).
Cray Inc. Preliminary and Proprietary 28
Linkinthehugetlbfslibraryintoyourcode‘‐lhugetlbfs’ SettheHUGETLB_MORECOREenvinyourrunscript.
Example:exportHUGETLB_MORECORE=yes Usetheaprunop1on–m###htoaskfor###MegofHUGEpages. Example:aprun–m500h(Request500MegsofHUGEpagesasavailable,use4Kpagesthereaier)
Example:aprun–m500hs(Request500MegsofHUGEpages,ifnotavailableterminatelaunch)
Note:IfnotenoughHUGEpagesareavailable,thecostoffillingtheremainingwith4Kpagesmaydegradeperformance.
Cray Inc. Preliminary and Proprietary 29
Butisn’tthatasystemcall?
TrickQues1on:No,butit’sexpensiveandshouldbedoneasinfrequentlyaspossible!
Cray Inc. Preliminary and Proprietary 30
GNUmalloclibrary malloc,calloc,realloc,freecalls
Fortrandynamicvariables
Malloclibrarysystemcalls Mmap,munmap=>forlargeralloca7ons Brk,sbrk=>increase/decreaseheap
Malloclibraryop1mizedforlowsystemmemoryuse Canresultinsystemcalls/minorpagefaults
Cray Inc. Preliminary and Proprietary 31
Detec1ng“bad”mallocbehavior Profiledata=>“excessivesystem7me”
Correc1ng“bad”mallocbehavior Eliminatemmapusebymalloc Increasethresholdtoreleaseheapmemory
Useenvironmentvariablestoaltermalloc MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=536870912(orappropriatesize) (onlytrimsheapwhenthisamounttotalisfreed)
Possibledownsides Heapfragmenta7on Userprocessmaycallmmapdirectly Userprocessmaylaunchotherprocesses
PGI’s–Msmartallocdoessomethingsimilarforyouatcompile1me
Cray Inc. Preliminary and Proprietary 32
Op1on
‐D
‐n
‐N
‐m
‐d
‐S
‐ss
Debug(showsthelayoutaprunwilluse)
NumberofMPItasksNote:Ifyoudonotspecifythenumberoftaskstoaprun,thesystemwilldefault
to1.
NumberoftasksperNode
MemoryrequiredperTask
NumberofthreadsperMPITask.Note:IfyouspecifyOMP_NUM_THREADSbutdonotgivea–dop1on,aprunwill
allocateyourthreadstoasinglecore.YoumustuseOMP_NUM_THREADStospecifythenumberofthreadsperMPItask,andyoumustuse–dtotellaprunhowtoplacethosethreads
NumberofPestoallocateperNUMANode
StrictmemorycontainmentperNUMANode
Descrip1on
33 Cray Inc. Preliminary and Proprietary
Torunusing1376MPItaskswith4threadsperMPItask:exportOMP_NUM_THREADS=4aprun‐ss‐N4‐d4‐n1376./xhpl_mp
Torunwithoutthreading:exportOMP_NUM_THREADS=1aprun–ss–N16‐n5504./xhpl_mp
Cray Inc. Preliminary and Proprietary 34
Thankstothefollowingpeopleforcontribu1ngexper1seandslidestothispresenta1on: JeffLarkin,CrayORNLCenterofExcellence CrayPerformanceTeam CrayMPTGroup CrayProgrammingEnvironmentGroup
Cray Inc. Preliminary and Proprietary 35