a slurm simulator: implementation and parametric analysis
TRANSCRIPT
![Page 1: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/1.jpg)
ASlurmSimulator:ImplementationandParametric
AnalysisNikolayA.Simakov,MartinsD.Innus,MatthewD.Jones,Robert L.
DeLeon,JosephP.White,StevenM.Gallo,Abani K.PatraandThomasR.Furlani
November13,2017,PMBS17atSC17
![Page 2: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/2.jpg)
CenterforComputationalResearch,UniversityatBuffalo• RegionalHPCCenter• ServingacademicandindustryusersfromwesternNY
• 8,000CoresAcademicCluster,3,456 CoresIndustryCluster
• 500activeusersand200PI• 106millionscoreshoursdeliveredduring2016-2017academicyear
![Page 3: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/3.jpg)
AToolforHPCSystemManagement• XDMoD:XDMetricsonDemand
• HPCresourcesusageandperformancemonitoringandanalysis
• Ondemand,responsive,accesstojobaccountingdata• NSFfundedanalyticsframeworkdevelopedforXSEDE• http://xdmod.ccr.buffalo.edu/
• ComprehensiveFrameworkforHPCManagement• Supportforseveralresourcemanagers(Slurm,PBS,
LSF,SGE)• Utilizationmetricsacrossmultipledimensions• MeasureQoS ofHPCInfrastructure(AppKernels)• Job-levelperformancedata
• Open XDMoD*: Open Source version for HPC Centers
• 100+ academic & industrial installations worldwide• http://open.xdmod.org/
• Utilized for Blue Waters Workload Analysis• Currently Carrying out “XSEDE” Workload Analysis
![Page 4: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/4.jpg)
CenterforComputationalResearch,UniversityatBuffalo
![Page 5: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/5.jpg)
EffectofNodeSharing• Undernodesharingseveraljobsareallowedtobeexecutedonsamenode.
• Jobshasdedicatedcores.
4xDD
R4RAM CPU0 CPU1
4xDD
R4RAM
LocalStorageNetwork
JobBJobA
JobC
• Therearecomputationaltaskswhichcannotefficientlyuseallofthecoresavailableonanode
• serialapplications• poorlyscalableparallelsoftware• smallproblemsizes• Timeimbalancedembarrassinglyparalleltaskssuchasparametersweeps
But how does it affect the application
performance?
![Page 6: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/6.jpg)
EffectofNodeSharing
Simakov et al., 2016. Proceedings of the XSEDE16. 1–8.
• Theeffectofnodesharingonapplicationperformanceissmall.
-2% 0% 2% 4% 6% 8%10%12%
GAMESS
NWChem
NAMD
Graph500
HPCC
IOR
MeanWallTimePercentDifference
Sharing>0 Sharing>0.750.0% 0.5% 1.0% 1.5%
ENZO
GAMESS
NWChem
NAMD
Graph500
HPCC
IOR
MeanWallTimePercentDiffrence
Sharing>0 Sharing>0.75
SingleCore SingleSocket
Whatistheoveralleffectonthewholesystem?
Willitactuallyincreasethroughput?
Tohavequantitativeanswerweneed
workflowsimulator!
![Page 7: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/7.jpg)
SlurmWorkloadManager• Slurmisanopen-sourceresourcemanagerforHPC
• Itprovideshighconfigurabilityforinhomogeneousresourcesandjobscheduling
• ItisusedonlargerangeofHPCresourcesfromsmalltoverylargesystems.
• Whichconfigurationisbestforparticularneeds?
ComputeNodes,Gen.i
ComputeNodes,Gen.i+1
ComputeNodes,Gen.i+2
BigMemoryNode,Gen.i
VeryBigMemoryNodes,Gen.i+1
GPUNodes,Gen.i+1
![Page 8: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/8.jpg)
SlurmSimulator• WhydoWeNeedSlurmSimulator?
• TocheckSlurmconfigurationprioritdeployment
• FindingmostoptimalparametersforSlurm
• Modelingoffuturesystems• WorkflowSimulators:
• Bricks,SimGrid,Simbatch,GridSimandAlea,MauiandMoabScheduler
• SlurmSimulators:• OriginalversiondevelopedbyAlejandroLucero• LaterimprovedbyTrofinoff andBenini
• WorksonlyforverysmallsystemsandSlurmVersionisoutdates.
![Page 9: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/9.jpg)
MakingSlurmSimulatorfromSlurm• StartedfromlateststableSlurmRelease
• Minimizednumberofprocesses• SlurmControllerperformssimulation• SerializeSlurmController
UserCommandssqueue,sbatch,etc.
SlurmController SlurmDB
SlurmDaemon,RunningoneachManaged
Resource
JobTraceFile
MS Sleep60
Sleep120 BF
MS Sleep60 MS Sleep60 MS Sleep60 MS Sleep60Sleep60
Sleep120 BF Sleep120
W
Time
![Page 10: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/10.jpg)
• SerializedSlurmController• InrealSlurmschedulingisefficientlyserialduetothreadlocks
• Duetolackingofthreadslocks,compilationwithoptimizationflagsonandwithassertfunctionsoff,thebackfillschedulerisabout10timesfasterinsimulationmode
SerializationofSlurmControllerSimulatorMainLoop
MainPriorityBasedScheduler
BackfillScheduler
SubmitNewJobs
TerminateRunningJobswhichExceededtheirPlannedWallTime
SlurmDBSynchronization
Incrementtimeifapplicable
tsim
treal
•Simulatingtime–Scaledreal-timewithtimestepping
•Shiftedrealtimeinmostplaces•Shiftedandscaledrealtimeinbackfillscheduler•Timeincrement(30-60seconds)incaseofnoevents
![Page 11: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/11.jpg)
SimulatingWorkflowswithSlurmSimulator
• RSlurmSimTools – Rlibraryforinputgenerationandresultsanalysis
• slurm_sim_tools –multipleconveniencescripts
SlurmDatabaseDaemon
MySQL
JobTraceSlurm
ControllerDaemon
UsersList
SimulationParameters
JobExecutionLog
SlurmParameters
RSlurmSimToolsInputgeneration
Historicjobs(sacct output)
UsersandAccounts(sacctmgr outputor
sql dump)
RSlurmSimToolsResultsAnalysis
run_sim.py - PythonConvenienceStart-up
Script
![Page 12: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/12.jpg)
SlurmSimulator:ImplementationDetails• Forbetterperformancenumberofprocessandthreadwasdecreased
• CommunicationwithSlurmd mimickedwithinSlurmctld
• Slurmctld wasserialized,functionareexecutedseriallyfrommainsimulationeventloop
• Slurmcompiledwithoptimizationflagsandwithassertfunctionsoff
• Timeisscaledwithinbackfillschedulerusuallybyfactorof10toreflectdifferenceinperformanceofrealSlurmandSlurmsimulator
• Attheendofmainsimulatoreventlooptimeisincrementedby1secondsifnoeventhappenedduringtheloop
• R-scriptsisusedtogeneratejobtracefiles• R-scriptsisusedtoanalyzedresults
SlurmDatabaseDaemon
MySQL
JobTraceSlurm
ControllerDaemon
UsersList
SimulationParameters
JobExecutionLog
SlurmParameters
![Page 13: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/13.jpg)
ValidatingSimulatorusing,Micro-Cluster,SmallModelSystem
• Micro-ClusterissmallmodelclustercreatedforSlurmsimulatorvalidation
• ReferencedatawasobtainedbyrunningregularSlurminfront-endmode
• Micro-Clusterconfigurationwaschosentotestconstrains,GRes,coresandmemoryasconsumableresources
• Theworkloadconsistedof500jobsandtakes12.9hourstocomplete
• 5usersbelongingto2accounts
NodeType NumberofNodes
CoresperNode
CPUType RAM
Compute 4 12 CPU-N 48GBCompute 4 12 CPU-M 48GBHighMemory 1 12 CPU-G 512GB
GPUCompute 1 12 CPU-G 48GB
Account1 Account2
![Page 14: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/14.jpg)
Micro-Cluster:SlurmSchedulingisnotUnique
• SimulationhassimilarvariabilitytorealSlurm
Job start time difference:
Between simulated and real Slurm runs
Between two real Slurm runs.
![Page 15: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/15.jpg)
VariabilityOrigin
MS Sleep60
Sleep120 BF
MS Sleep60 MS Sleep60 MS Sleep60 MS Sleep60Sleep60
Sleep120 BF Sleep120
W
TimeJobISubmitted
JobIPlaces
JobJSubmitted
JobKSubmitted
JobIFinished
JobKPlaces
JobJhaslowerprioritythenJobK
MS Sleep60
BF
MS Sleep60 MS Sleep60 MS Sleep60 MSSleep60
Sleep120 BF Sleep120 BF
W
TimeJobISubmitted
JobIPlaces
JobJSubmitted
JobKSubmitted
JobIFinished
JobJPlaces
W W
![Page 16: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/16.jpg)
Micro-Cluster:ComparisonofUtilizationandJobPriorities
• ResourceutilizationandjobprioritychangesinsimulationissimilartorealSlurm
Changing of Job Priority Factor over Time
![Page 17: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/17.jpg)
Micro-Cluster:Modifyingfair-sharepriorityfactorweight• Fair-sharepriorityfactorweightwasincreasedby20%
• User1shouldbiggestdecreaseinwaittime
• Realwaittimearewithintherangeofpredictedvalues
Account1 Account2
![Page 18: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/18.jpg)
StudyingUB-HPCClusterNodeType Number
ofNodesCores perNode
CPUType RAM
Compute 32 16 IntelE5-2660 128GB
Compute 372 12 IntelE5645 48GB
Compute 128 8 IntelL5630 24GB
Compute 128 8 IntelL5520 24GB
HighMemory 8 32 IntelE7-4830 256GB
HighMemory 8 32 AMD6132HE 256GB
HighMemory 2 32 IntelE7-4830 512GB
GPUCompute 26 12 IntelX5650 48GB
Historicworkloadfor24dayswasused(October4,2016toOctober28,2016)
![Page 19: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/19.jpg)
StudyingUB-HPCCluster:SimulationvsHistoricData• Simulationwasnothavinginitialhistoricusagethereforeinitialfair-shareprioritieswereincorrect
![Page 20: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/20.jpg)
StudyingUB-HPCCluster:SimulationvsHistoricData
• MissingtheinfluencefromexcessiveRPCcalls,theperformancehitfrommultiplethreadsstartedforjobsstart-upandfinalization
![Page 21: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/21.jpg)
Reducingbf_max_job_user• bf_max_job_user specifiesmaximalnumberofuser’sjobsconsideredbybackfillschedulerforscheduling
• bf_max_job_user wasreducedfrom20to10
• 0.1%(40minutesor0.1%)increaseintimetocompletetheworkload
• Themeanwaittimeis8minuteslongerandthestandarddeviationofthewaittimedifferencesis3hours.
• 25%decreaseinthenumberofjobsconsideredforscheduling
• 30%decreaseinbackfillschedulerruntime.
![Page 22: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/22.jpg)
UB-HPCCluster:Nodesharing• The exclusive mode takes 10.8 more days (45% more
time) to complete the same workload
• The average increase in waiting time is 5.1 days with a standard deviation of 6.6 days.
• The 45% increase in time to complete the same load can be translated into the need to have a 45% larger cluster to serve the same workload.
![Page 23: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/23.jpg)
StudyingStampede2
• NodeSharingonSkylake-XNodes• SharingbySocketsorbyCores
• SeparateControllerForKNLandSkylake-XNodes
NodeType NumberofNodes
Cores perNode
CPUType RAM
IntelKnightsLanding 6400 68 IntelXeonPhi7250 96GB+16GB
IntelXeonSkylake-X 1736 48 IntelXeonPlatinum8160 192GB
• 12 weeks workload was generated from stampede 1 historic workload (2015-05-16 to 2015-08-08)
• Number of jobs was scaled proportional to node count• Sub-node jobs was calculated using CPU utilization
![Page 24: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/24.jpg)
Stampede2.NodeSharingandSeparateSlurmControllers
• Nodesharingandseparatecontrollerscutwaitingtimesnearlyinhalf
Controller NodeSharingonSKXNodes WaitHours,Mean WaitHours,MeanWeightedbyNodeHours
Jobs on SKX NodesSingle no sharing 10.9(0%) 17.0(0%)
sharing by sockets 8.2(-25%) 15.5(-9%)sharing by cores 8.2(-24%) 15.5(-9%)
Separate no sharing 7.1(-35%) 15.0(-12%)sharing by sockets 5.3(-51%) 13.8(-19%)sharing by cores 5.5(-49%) 13.9(-18%)
Jobs on KNL NodesSingle no sharing 8.6(0%) 9.2(0%)
sharing by sockets 7.2(-16%) 9.2(-1%)sharing by cores 7.3(-15%) 9.1(-1%)
Separate no sharing 8.2(-4%) 9.4(2%)
![Page 25: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/25.jpg)
SimulationSpeed
Thesimulatorspeedheavilydependsontheclustersize,workloadandSlurmconfiguration
SystemName
SystemCharacteristics
SimulationCharacteristics
SimulationSpeed,Simulated daysper
hourMicroCluster 120cores Nodesharingon 112.0
UB-HPC 8000cores Exclusivenodes 0.8
Nodesharingon 5.4
Smallerbf_max_job_user
17.3
Stampede 6400nodes 0.5
![Page 26: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/26.jpg)
HowtoGetSimulatorVariousutilitiesanddocumentationareavailableat
https://github.com/nsimakov/slurm_sim_toolsSlurmSimulatorcode:
https://github.com/nsimakov/slurm_simulator
![Page 27: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/27.jpg)
Conclusions• AnewSlurmsimulatorwasdevelopedcapableofsimulationofamid-sizedclusterwithasimulationspeedofmultipledaysperhour
• ItsvaliditywasestablishedbycomparisonwithactualSlurmrunswhichshowedagoodmatchwithsimilarmeanvaluesforjobstarttimeswithaslightlylargerstandarddeviation
• SimulatorcanbeusedtostudyanumberofSlurmparametersthateffectsystemutilizationandthroughputsuchasfairsharepolicy,maximumnumberofuserjobsconsideredforbackfill,andnodesharingpolicy
• Asexpectedfairsharepolicyaltersjobprioritiesandstarttimesbutinanon-trivialfashion
• Decreasingthemaximalnumberofuser’sjobconsideredbythebackfillschedulerfrom20to10wasfoundtohaveaminimaleffectonaverageschedulinganddecreasethebackfillschedulerruntimeby30%
• Thesimulationstudyofnodesharingonourclustershoweda45%increaseinthetimeneededtocompletetheworkloadinexclusivemodecomparedtosharedmode.
• Foralargesystem(>6000nodes)comprisedoftwodistinctsub-clusters,twoseparateSlurmcontrollersandaddingnodesharingcancutwaitingtimesnearlyinhalf.
![Page 28: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/28.jpg)
Acknowledgments• XDMoD DevelopersTeam
• UB: TomFurlani,MattJones,SteveGallo,BobDeLeon,MartinsInnus,JeffPalmer,BenPlessenger,RyanRathsam,NikolaySimakov,JeanetteSperhac,JoeWhite,TomYearke,Rudra Chakraborty,CynthiaCornelius,Abani Patra
• Indiana: Gregor vonLaszewski,FugangWang
• UniversityofTexas: JimBrowne• TACC: BillBarth,ToddEvans,Weijia Xu• NCAR:Shiquan Su
• Funding:• NationalScienceFoundationunderawardsOCI1025159,1203560,ACI1445806.
![Page 29: A Slurm Simulator: Implementation and Parametric Analysis](https://reader036.vdocuments.mx/reader036/viewer/2022072219/62d9fa95d21097585f78c405/html5/thumbnails/29.jpg)
Questions?• VisitourBoothat:
• UniversityatBuffalo/CenterforComputationalResearch,booth1867
• VisitXDMoD Birds-of-a-Feather(BOF)session:
• TrackingandAnalyzingJob-levelActivityUsingOpenXDMoD,XALTandOGRT
• Tuesday,November14th,5:15pm- 7pm• Room:205-207
UB
Variousutilitiesanddocumentationareavailableathttps://github.com/nsimakov/slurm_sim_tools
SlurmSimulatorcode:https://github.com/nsimakov/slurm_simulator