improving hpc applications scheduling with predictions ... · introduction introduction...

29
Improving HPC applications scheduling with predictions based on automatically-collected historical data Carlos Fenoy Garc´ ıa [email protected] September 2014

Upload: others

Post on 10-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Improving HPC applications scheduling with predictionsbased on automatically-collected historical data

Carlos Fenoy Garcı[email protected]

September 2014

Page 2: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Index

1 IntroductionIntroductionMotivationObjectives

2 Proposed SolutionHPerf

3 EvaluationOverhead analysisApplications characterization and workload generation

4 TestsDescriptionWorkload High

5 Conclusions

2 / 27

Page 3: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Introduction Introduction

Introduction

Supercomputers have increased their resource number in last years

⇓Resource management has become critical

to maximise its usage due to resource sharing

The evolution pace of different components is not equal on all the parts:Memory bandwidth evolves slower than processors speed.

Existing resource selection mechanisms only focus on CPUS and Memory.Other limiting resources exist

Interconnect bandwidthMemory bandwidth

3 / 27

Page 4: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Introduction Introduction

Introduction

Previous work exist to solve the memory bandwidth management issue. ThePhD thesis by Francesc Guim introduces the Less Consume selection policy,that considers the memory bandwidth as a new resource.

Less Consume was later validated, porting the policy to a real system andanalysing its behaviour.

However, this port was done on MareNostrum 2, a different arquitecture tothe one currently available in MareNostrum 3.

This policy requires users to specify the amount of memory bandwidthneeded by each job.

4 / 27

Page 5: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Introduction Motivation

Motivation

Users do not necessarily know the resources used by their applications.

The impact of sharing resources on the applications perfomance.

Improvement of monitoring systems enables the collection of multiple metricsper job.

The lack of trusty mechanisms for specifying resources.

5 / 27

Page 6: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Introduction Objectives

Objectives

Port previous work to a new arquitecture (MareNostrum 3).

Enhance an existing resource selection policy with the usage of historical datato predict the resource usage by jobs.

Aware of shared resources (memory bandwidth)Historical data collected by a transparent monitoring system

⇓Data is used in job scheduling

Analyse the behaviour and the benefits of the policy

Compare the policy with other existing policies

6 / 27

Page 7: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Proposed Solution

Index

1 IntroductionIntroductionMotivationObjectives

2 Proposed SolutionHPerf

3 EvaluationOverhead analysisApplications characterization and workload generation

4 TestsDescriptionWorkload High

5 Conclusions

7 / 27

Page 8: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Proposed Solution HPerf

Historical Performance Data (HPerf)

The system proposed predicts the resources usage of an application based onmonitoring information gathered from previous equivalent executions.

This system uses a user-provided tag to identify the kind of job.

Combines existing technologies to improve the scheduling of applications:

Resource selection policy aware of memory bandwidth (Less Consume)Monitoring system able to collect per job information (PerfMiner)Scheduler and resource management (Slurm)

8 / 27

Page 9: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Proposed Solution HPerf

Consumable Resources (ConsRes)

9 / 27

Page 10: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Proposed Solution HPerf

Less Consume (LessC)

9 / 27

Page 11: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Proposed Solution HPerf

Historical Performance data (HPerf)

If at the submission time the tag is found, the average of the resource usagewill be used as the resource requirements for the job.If it is not found, the application will run with exclusive execution to favourthe monitoring.

9 / 27

Page 12: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Evaluation

Index

1 IntroductionIntroductionMotivationObjectives

2 Proposed SolutionHPerf

3 EvaluationOverhead analysisApplications characterization and workload generation

4 TestsDescriptionWorkload High

5 Conclusions

10 / 27

Page 13: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Evaluation

System analysis

In order be able to analyse the system, the following previous work was done:

PerfMiner overhead analysis.

Applications characterization.

Study of the resources used by an application.Study of the time elapsed per iteration for each application.

Workload generation:A workload generator was chosen due to the unfeasibility of using realproduction applications.

11 / 27

Page 14: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Evaluation Overhead analysis

PerfMiner study

Overhead analysis of PerfMiner with two sets of 100 jobs.(CG class D with 64 tasks)

12 / 27

Page 15: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Evaluation Applications characterization and workload generation

Workload generation: applications characterization

Applications used for the generation of the workload characterized by their use ofmemory bandwidth:

High

CG class DSynthetic application with high memory bandwidth usage

Medium

CG class C

Low

CG class B

13 / 27

Page 16: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Evaluation Applications characterization and workload generation

Workload generation

The Lublin-Feitelson model was used to generate the workload.

Generates 100 jobs

Provides:

Number of cpus (2-64)Job durationJob arrival time

Combining the workload with the applications list 2 final workloads were obtained:

Medium Highhigh 54 84

medium 27 7low 19 9

The time limit used for the jobs of the workload was the calculated with theapplication running standalone.

14 / 27

Page 17: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests

Index

1 IntroductionIntroductionMotivationObjectives

2 Proposed SolutionHPerf

3 EvaluationOverhead analysisApplications characterization and workload generation

4 TestsDescriptionWorkload High

5 Conclusions

15 / 27

Page 18: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Description

Description

Both workloads (HIGH & MEDIUM) were run with 3 different scheduling policies:

ConsRes: Default Slurm scheduling policy. Considers only cpus and memory.

LessC: Less Consume policy aware of the memory bandwidth usage perapplication.

HPerf: Less Consume + Historical Perfomance data. Automatically gets thejobs resource consumption.

With empty database.With preloaded database.

For each policy tests were run with:

Unlimited grace time.

Limited grace time. Kills the jobs after 5 minutes of the time limit.

16 / 27

Page 19: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

HIGH: Unlimited time

CR

HPerf DB

17 / 27

Page 20: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

HIGH: Limited grace time

CR

HPerf DB

18 / 27

Page 21: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

Finishing status

Unlimited

Limited

19 / 27

Page 22: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

Slowdown

Unlimited

Limited

20 / 27

Page 23: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

Waiting time

Unlimited

Limited

21 / 27

Page 24: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

CPU usage

Unlimited

Limited

22 / 27

Page 25: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

Useful CPU usage

23 / 27

Page 26: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Tests Workload High

Job run time predictability

Unlimited

Limited

24 / 27

Page 27: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Conclusions

Index

1 IntroductionIntroductionMotivationObjectives

2 Proposed SolutionHPerf

3 EvaluationOverhead analysisApplications characterization and workload generation

4 TestsDescriptionWorkload High

5 Conclusions

25 / 27

Page 28: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Conclusions

Conclusions

The usage of the monitoring data for job scheduling results in betterallocation that:

Improves the applications performance.Increases the useful cpu usage.Reduces the shared resources overload.Increases the waiting time.

Avoids users providing information they may not know.

Requires better mechanisms to measure the memory bandwidth to increasethe solution performance.

26 / 27

Page 29: Improving HPC applications scheduling with predictions ... · Introduction Introduction Introduction Previous work exist to solve the memory bandwidth management issue. The PhD thesis

Improving HPC applications scheduling with predictionsbased on automatically-collected historical data

Carlos Fenoy Garcı[email protected]

September 2014