a slurm simulator: implementation and parametric analysis

29
A Slurm Simulator: Implementation and Parametric Analysis Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones,Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra and Thomas R. Furlani November 13, 2017, PMBS17 at SC17

Upload: others

Post on 22-Jul-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Slurm Simulator: Implementation and Parametric Analysis

ASlurmSimulator:ImplementationandParametric

AnalysisNikolayA.Simakov,MartinsD.Innus,MatthewD.Jones,Robert L.

DeLeon,JosephP.White,StevenM.Gallo,Abani K.PatraandThomasR.Furlani

November13,2017,PMBS17atSC17

Page 2: A Slurm Simulator: Implementation and Parametric Analysis

CenterforComputationalResearch,UniversityatBuffalo• RegionalHPCCenter• ServingacademicandindustryusersfromwesternNY

• 8,000CoresAcademicCluster,3,456 CoresIndustryCluster

• 500activeusersand200PI• 106millionscoreshoursdeliveredduring2016-2017academicyear

Page 3: A Slurm Simulator: Implementation and Parametric Analysis

AToolforHPCSystemManagement• XDMoD:XDMetricsonDemand

• HPCresourcesusageandperformancemonitoringandanalysis

• Ondemand,responsive,accesstojobaccountingdata• NSFfundedanalyticsframeworkdevelopedforXSEDE• http://xdmod.ccr.buffalo.edu/

• ComprehensiveFrameworkforHPCManagement• Supportforseveralresourcemanagers(Slurm,PBS,

LSF,SGE)• Utilizationmetricsacrossmultipledimensions• MeasureQoS ofHPCInfrastructure(AppKernels)• Job-levelperformancedata

• Open XDMoD*: Open Source version for HPC Centers

• 100+ academic & industrial installations worldwide• http://open.xdmod.org/

• Utilized for Blue Waters Workload Analysis• Currently Carrying out “XSEDE” Workload Analysis

Page 4: A Slurm Simulator: Implementation and Parametric Analysis

CenterforComputationalResearch,UniversityatBuffalo

Page 5: A Slurm Simulator: Implementation and Parametric Analysis

EffectofNodeSharing• Undernodesharingseveraljobsareallowedtobeexecutedonsamenode.

• Jobshasdedicatedcores.

4xDD

R4RAM CPU0 CPU1

4xDD

R4RAM

LocalStorageNetwork

JobBJobA

JobC

• Therearecomputationaltaskswhichcannotefficientlyuseallofthecoresavailableonanode

• serialapplications• poorlyscalableparallelsoftware• smallproblemsizes• Timeimbalancedembarrassinglyparalleltaskssuchasparametersweeps

But how does it affect the application

performance?

Page 6: A Slurm Simulator: Implementation and Parametric Analysis

EffectofNodeSharing

Simakov et al., 2016. Proceedings of the XSEDE16. 1–8.

• Theeffectofnodesharingonapplicationperformanceissmall.

-2% 0% 2% 4% 6% 8%10%12%

GAMESS

NWChem

NAMD

Graph500

HPCC

IOR

MeanWallTimePercentDifference

Sharing>0 Sharing>0.750.0% 0.5% 1.0% 1.5%

ENZO

GAMESS

NWChem

NAMD

Graph500

HPCC

IOR

MeanWallTimePercentDiffrence

Sharing>0 Sharing>0.75

SingleCore SingleSocket

Whatistheoveralleffectonthewholesystem?

Willitactuallyincreasethroughput?

Tohavequantitativeanswerweneed

workflowsimulator!

Page 7: A Slurm Simulator: Implementation and Parametric Analysis

SlurmWorkloadManager• Slurmisanopen-sourceresourcemanagerforHPC

• Itprovideshighconfigurabilityforinhomogeneousresourcesandjobscheduling

• ItisusedonlargerangeofHPCresourcesfromsmalltoverylargesystems.

• Whichconfigurationisbestforparticularneeds?

ComputeNodes,Gen.i

ComputeNodes,Gen.i+1

ComputeNodes,Gen.i+2

BigMemoryNode,Gen.i

VeryBigMemoryNodes,Gen.i+1

GPUNodes,Gen.i+1

Page 8: A Slurm Simulator: Implementation and Parametric Analysis

SlurmSimulator• WhydoWeNeedSlurmSimulator?

• TocheckSlurmconfigurationprioritdeployment

• FindingmostoptimalparametersforSlurm

• Modelingoffuturesystems• WorkflowSimulators:

• Bricks,SimGrid,Simbatch,GridSimandAlea,MauiandMoabScheduler

• SlurmSimulators:• OriginalversiondevelopedbyAlejandroLucero• LaterimprovedbyTrofinoff andBenini

• WorksonlyforverysmallsystemsandSlurmVersionisoutdates.

Page 9: A Slurm Simulator: Implementation and Parametric Analysis

MakingSlurmSimulatorfromSlurm• StartedfromlateststableSlurmRelease

• Minimizednumberofprocesses• SlurmControllerperformssimulation• SerializeSlurmController

UserCommandssqueue,sbatch,etc.

SlurmController SlurmDB

SlurmDaemon,RunningoneachManaged

Resource

JobTraceFile

MS Sleep60

Sleep120 BF

MS Sleep60 MS Sleep60 MS Sleep60 MS Sleep60Sleep60

Sleep120 BF Sleep120

W

Time

Page 10: A Slurm Simulator: Implementation and Parametric Analysis

• SerializedSlurmController• InrealSlurmschedulingisefficientlyserialduetothreadlocks

• Duetolackingofthreadslocks,compilationwithoptimizationflagsonandwithassertfunctionsoff,thebackfillschedulerisabout10timesfasterinsimulationmode

SerializationofSlurmControllerSimulatorMainLoop

MainPriorityBasedScheduler

BackfillScheduler

SubmitNewJobs

TerminateRunningJobswhichExceededtheirPlannedWallTime

SlurmDBSynchronization

Incrementtimeifapplicable

tsim

treal

•Simulatingtime–Scaledreal-timewithtimestepping

•Shiftedrealtimeinmostplaces•Shiftedandscaledrealtimeinbackfillscheduler•Timeincrement(30-60seconds)incaseofnoevents

Page 11: A Slurm Simulator: Implementation and Parametric Analysis

SimulatingWorkflowswithSlurmSimulator

• RSlurmSimTools – Rlibraryforinputgenerationandresultsanalysis

• slurm_sim_tools –multipleconveniencescripts

SlurmDatabaseDaemon

MySQL

JobTraceSlurm

ControllerDaemon

UsersList

SimulationParameters

JobExecutionLog

SlurmParameters

RSlurmSimToolsInputgeneration

Historicjobs(sacct output)

UsersandAccounts(sacctmgr outputor

sql dump)

RSlurmSimToolsResultsAnalysis

run_sim.py - PythonConvenienceStart-up

Script

Page 12: A Slurm Simulator: Implementation and Parametric Analysis

SlurmSimulator:ImplementationDetails• Forbetterperformancenumberofprocessandthreadwasdecreased

• CommunicationwithSlurmd mimickedwithinSlurmctld

• Slurmctld wasserialized,functionareexecutedseriallyfrommainsimulationeventloop

• Slurmcompiledwithoptimizationflagsandwithassertfunctionsoff

• Timeisscaledwithinbackfillschedulerusuallybyfactorof10toreflectdifferenceinperformanceofrealSlurmandSlurmsimulator

• Attheendofmainsimulatoreventlooptimeisincrementedby1secondsifnoeventhappenedduringtheloop

• R-scriptsisusedtogeneratejobtracefiles• R-scriptsisusedtoanalyzedresults

SlurmDatabaseDaemon

MySQL

JobTraceSlurm

ControllerDaemon

UsersList

SimulationParameters

JobExecutionLog

SlurmParameters

Page 13: A Slurm Simulator: Implementation and Parametric Analysis

ValidatingSimulatorusing,Micro-Cluster,SmallModelSystem

• Micro-ClusterissmallmodelclustercreatedforSlurmsimulatorvalidation

• ReferencedatawasobtainedbyrunningregularSlurminfront-endmode

• Micro-Clusterconfigurationwaschosentotestconstrains,GRes,coresandmemoryasconsumableresources

• Theworkloadconsistedof500jobsandtakes12.9hourstocomplete

• 5usersbelongingto2accounts

NodeType NumberofNodes

CoresperNode

CPUType RAM

Compute 4 12 CPU-N 48GBCompute 4 12 CPU-M 48GBHighMemory 1 12 CPU-G 512GB

GPUCompute 1 12 CPU-G 48GB

Account1 Account2

Page 14: A Slurm Simulator: Implementation and Parametric Analysis

Micro-Cluster:SlurmSchedulingisnotUnique

• SimulationhassimilarvariabilitytorealSlurm

Job start time difference:

Between simulated and real Slurm runs

Between two real Slurm runs.

Page 15: A Slurm Simulator: Implementation and Parametric Analysis

VariabilityOrigin

MS Sleep60

Sleep120 BF

MS Sleep60 MS Sleep60 MS Sleep60 MS Sleep60Sleep60

Sleep120 BF Sleep120

W

TimeJobISubmitted

JobIPlaces

JobJSubmitted

JobKSubmitted

JobIFinished

JobKPlaces

JobJhaslowerprioritythenJobK

MS Sleep60

BF

MS Sleep60 MS Sleep60 MS Sleep60 MSSleep60

Sleep120 BF Sleep120 BF

W

TimeJobISubmitted

JobIPlaces

JobJSubmitted

JobKSubmitted

JobIFinished

JobJPlaces

W W

Page 16: A Slurm Simulator: Implementation and Parametric Analysis

Micro-Cluster:ComparisonofUtilizationandJobPriorities

• ResourceutilizationandjobprioritychangesinsimulationissimilartorealSlurm

Changing of Job Priority Factor over Time

Page 17: A Slurm Simulator: Implementation and Parametric Analysis

Micro-Cluster:Modifyingfair-sharepriorityfactorweight• Fair-sharepriorityfactorweightwasincreasedby20%

• User1shouldbiggestdecreaseinwaittime

• Realwaittimearewithintherangeofpredictedvalues

Account1 Account2

Page 18: A Slurm Simulator: Implementation and Parametric Analysis

StudyingUB-HPCClusterNodeType Number

ofNodesCores perNode

CPUType RAM

Compute 32 16 IntelE5-2660 128GB

Compute 372 12 IntelE5645 48GB

Compute 128 8 IntelL5630 24GB

Compute 128 8 IntelL5520 24GB

HighMemory 8 32 IntelE7-4830 256GB

HighMemory 8 32 AMD6132HE 256GB

HighMemory 2 32 IntelE7-4830 512GB

GPUCompute 26 12 IntelX5650 48GB

Historicworkloadfor24dayswasused(October4,2016toOctober28,2016)

Page 19: A Slurm Simulator: Implementation and Parametric Analysis

StudyingUB-HPCCluster:SimulationvsHistoricData• Simulationwasnothavinginitialhistoricusagethereforeinitialfair-shareprioritieswereincorrect

Page 20: A Slurm Simulator: Implementation and Parametric Analysis

StudyingUB-HPCCluster:SimulationvsHistoricData

• MissingtheinfluencefromexcessiveRPCcalls,theperformancehitfrommultiplethreadsstartedforjobsstart-upandfinalization

Page 21: A Slurm Simulator: Implementation and Parametric Analysis

Reducingbf_max_job_user• bf_max_job_user specifiesmaximalnumberofuser’sjobsconsideredbybackfillschedulerforscheduling

• bf_max_job_user wasreducedfrom20to10

• 0.1%(40minutesor0.1%)increaseintimetocompletetheworkload

• Themeanwaittimeis8minuteslongerandthestandarddeviationofthewaittimedifferencesis3hours.

• 25%decreaseinthenumberofjobsconsideredforscheduling

• 30%decreaseinbackfillschedulerruntime.

Page 22: A Slurm Simulator: Implementation and Parametric Analysis

UB-HPCCluster:Nodesharing• The exclusive mode takes 10.8 more days (45% more

time) to complete the same workload

• The average increase in waiting time is 5.1 days with a standard deviation of 6.6 days.

• The 45% increase in time to complete the same load can be translated into the need to have a 45% larger cluster to serve the same workload.

Page 23: A Slurm Simulator: Implementation and Parametric Analysis

StudyingStampede2

• NodeSharingonSkylake-XNodes• SharingbySocketsorbyCores

• SeparateControllerForKNLandSkylake-XNodes

NodeType NumberofNodes

Cores perNode

CPUType RAM

IntelKnightsLanding 6400 68 IntelXeonPhi7250 96GB+16GB

IntelXeonSkylake-X 1736 48 IntelXeonPlatinum8160 192GB

• 12 weeks workload was generated from stampede 1 historic workload (2015-05-16 to 2015-08-08)

• Number of jobs was scaled proportional to node count• Sub-node jobs was calculated using CPU utilization

Page 24: A Slurm Simulator: Implementation and Parametric Analysis

Stampede2.NodeSharingandSeparateSlurmControllers

• Nodesharingandseparatecontrollerscutwaitingtimesnearlyinhalf

Controller NodeSharingonSKXNodes WaitHours,Mean WaitHours,MeanWeightedbyNodeHours

Jobs on SKX NodesSingle no sharing 10.9(0%) 17.0(0%)

sharing by sockets 8.2(-25%) 15.5(-9%)sharing by cores 8.2(-24%) 15.5(-9%)

Separate no sharing 7.1(-35%) 15.0(-12%)sharing by sockets 5.3(-51%) 13.8(-19%)sharing by cores 5.5(-49%) 13.9(-18%)

Jobs on KNL NodesSingle no sharing 8.6(0%) 9.2(0%)

sharing by sockets 7.2(-16%) 9.2(-1%)sharing by cores 7.3(-15%) 9.1(-1%)

Separate no sharing 8.2(-4%) 9.4(2%)

Page 25: A Slurm Simulator: Implementation and Parametric Analysis

SimulationSpeed

Thesimulatorspeedheavilydependsontheclustersize,workloadandSlurmconfiguration

SystemName

SystemCharacteristics

SimulationCharacteristics

SimulationSpeed,Simulated daysper

hourMicroCluster 120cores Nodesharingon 112.0

UB-HPC 8000cores Exclusivenodes 0.8

Nodesharingon 5.4

Smallerbf_max_job_user

17.3

Stampede 6400nodes 0.5

Page 26: A Slurm Simulator: Implementation and Parametric Analysis

HowtoGetSimulatorVariousutilitiesanddocumentationareavailableat

https://github.com/nsimakov/slurm_sim_toolsSlurmSimulatorcode:

https://github.com/nsimakov/slurm_simulator

Page 27: A Slurm Simulator: Implementation and Parametric Analysis

Conclusions• AnewSlurmsimulatorwasdevelopedcapableofsimulationofamid-sizedclusterwithasimulationspeedofmultipledaysperhour

• ItsvaliditywasestablishedbycomparisonwithactualSlurmrunswhichshowedagoodmatchwithsimilarmeanvaluesforjobstarttimeswithaslightlylargerstandarddeviation

• SimulatorcanbeusedtostudyanumberofSlurmparametersthateffectsystemutilizationandthroughputsuchasfairsharepolicy,maximumnumberofuserjobsconsideredforbackfill,andnodesharingpolicy

• Asexpectedfairsharepolicyaltersjobprioritiesandstarttimesbutinanon-trivialfashion

• Decreasingthemaximalnumberofuser’sjobconsideredbythebackfillschedulerfrom20to10wasfoundtohaveaminimaleffectonaverageschedulinganddecreasethebackfillschedulerruntimeby30%

• Thesimulationstudyofnodesharingonourclustershoweda45%increaseinthetimeneededtocompletetheworkloadinexclusivemodecomparedtosharedmode.

• Foralargesystem(>6000nodes)comprisedoftwodistinctsub-clusters,twoseparateSlurmcontrollersandaddingnodesharingcancutwaitingtimesnearlyinhalf.

Page 28: A Slurm Simulator: Implementation and Parametric Analysis

Acknowledgments• XDMoD DevelopersTeam

• UB: TomFurlani,MattJones,SteveGallo,BobDeLeon,MartinsInnus,JeffPalmer,BenPlessenger,RyanRathsam,NikolaySimakov,JeanetteSperhac,JoeWhite,TomYearke,Rudra Chakraborty,CynthiaCornelius,Abani Patra

• Indiana: Gregor vonLaszewski,FugangWang

• UniversityofTexas: JimBrowne• TACC: BillBarth,ToddEvans,Weijia Xu• NCAR:Shiquan Su

• Funding:• NationalScienceFoundationunderawardsOCI1025159,1203560,ACI1445806.

Page 29: A Slurm Simulator: Implementation and Parametric Analysis

Questions?• VisitourBoothat:

• UniversityatBuffalo/CenterforComputationalResearch,booth1867

• VisitXDMoD Birds-of-a-Feather(BOF)session:

• TrackingandAnalyzingJob-levelActivityUsingOpenXDMoD,XALTandOGRT

• Tuesday,November14th,5:15pm- 7pm• Room:205-207

UB

Variousutilitiesanddocumentationareavailableathttps://github.com/nsimakov/slurm_sim_tools

SlurmSimulatorcode:https://github.com/nsimakov/slurm_simulator