heteropar-accurate-emul-cpu-slides.pdf - hal.inria.fr · validation of distributed systems...

23
Accurate emulation of CPU performance Tomasz Buchert 1 Lucas Nussbaum 2 Jens Gustedt 1 1 INRIA Nancy – Grand Est 2 LORIA / Nancy - Universit´ e

Upload: vudat

Post on 08-Oct-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Accurate emulation of CPU performance

Tomasz Buchert1 Lucas Nussbaum2 Jens Gustedt1

1 INRIA Nancy – Grand Est

2 LORIA / Nancy - Universite

Validation of distributed systems

Approaches:Theoretical approach (paper and pencil)

, the most general results and understanding/ very hard (leads to unsolvability results)

Experimentation (real application on a real environment), realistic context, credibility/ difficulty of preparation and control, questionable reproducibility

Simulation (modeled application inside modeled environment), very simple and perfectly reproducible/ experimental bias, possibly unrealistic

Emulation (real application inside a modeled environment), control over the experiment parameters/ difficult

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 2 / 20

Emulation

The perfect emulated environment should emulate (independently):

Network bandwidth, latency, topologyPerformance and number of CPUsMemory capabilitiesBackground noise (network, CPU, faults)

Already implemented in Wrekavoc – a tool to define and controlheterogeneity of the cluster (but not perfect yet!)

In this talk, however, we specifically concentrate on

Emulation of CPU

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20

Emulation

The perfect emulated environment should emulate (independently):

Network bandwidth, latency, topologyPerformance and number of CPUsMemory capabilitiesBackground noise (network, CPU, faults)

Already implemented in Wrekavoc – a tool to define and controlheterogeneity of the cluster (but not perfect yet!)

In this talk, however, we specifically concentrate on

Emulation of CPU

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20

CPU emulation

Various elements of CPU architecture could be emulated:

speednumber of coressizes and properties of caches (and topology thereof)memory access speed (especially for NUMA systems)

In this talk, we will talk about

Degradation of CPU speed

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20

CPU emulation

Various elements of CPU architecture could be emulated:

speednumber of coressizes and properties of caches (and topology thereof)memory access speed (especially for NUMA systems)

In this talk, we will talk about

Degradation of CPU speed

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20

An example

50 %

Unused

CPU 1

50 %

Unused

CPU 2

70 %

Unused

CPU 3

30 %

Unused

CPU 4

(1) controlling speed of each CPU/core independently

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 5 / 20

An example (continued)

50 %

Unused

CPU 1

50 %

Unused

CPU 2

70 %

Unused

CPU 3

30 %

Unused

CPU 4

(2) being able to create separated scheduling zones

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 6 / 20

Dynamic frequency scaling (CPU-Freq)

AKA Intel Enhanced SpeedStep or AMD Cool’n’QuietHardware solution to reduce:

heatnoisepower usage

For:no overhead of emulationcompletely unintrusivemeaningful CPU time measure

Against:only a finite set of different frequency levels

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 7 / 20

CPU-Lim

Method available in WrekavocAlgorithm:

if CPU usage ≥ threshold → send SIGSTOP to the processif CPU usage < threshold → send SIGCONT to the process

CPU usage = CPU time of the processprocess lifetime

For:easy and almost POSIX-compliant

Against:intrusive and unscalabledecision based on one process instead of global CPU usagesleeping is indistinguishable from preemption

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 8 / 20

Fracas

Based on idea from KRASH (load injection tool) ideaUses Linux Cgroups and Completely Fair SchedulerA predefined portion of the CPU is given to tasks burning CPUAll other processes are given the remaining CPU time

Emulatedprocesses

CPU burner

Core 1

Emulatedprocesses

CPU burner

Core 2

Emulatedprocesses

CPU burner

Core 3

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20

Fracas

Based on idea from KRASH (load injection tool) ideaUses Linux Cgroups and Completely Fair SchedulerA predefined portion of the CPU is given to tasks burning CPUAll other processes are given the remaining CPU time

For:unintrusivescalable

Against:unportable to other systemssensitive to the configuration of the scheduler

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20

Fracas and latency of the scheduler

1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]

3.03.54.04.55.05.56.06.57.0

GFL

OP

/ s

0.1 ms1 ms10 ms100 ms1000 ms

The smaller the latency, the better the emulation

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 10 / 20

Evaluation

Based on different types of work:CPU intensive (Linpack benchmark)IO boundmultiprocessingmultithreadingmemory speed (STREAM benchmark)

X-axis – emulated frequencyY-axis – speed perceived by the benchmarkeach test repeated 10 times, results = average95% confidence interval using t-Student distributionEvaluation performed on Grid’5000 platform

nodes with two quad-core Intel Xeon X5570 processorsnodes with a pair of single-core AMD Opteron 252 processors

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 11 / 20

Grid’5000

9 sites, 1600 machinesLille, Rennes, Orsay, Nancy, Bordeaux, Lyon, Grenoble, Toulouse, Sophia

Dedicated to research on distributed systems and HPC

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 12 / 20

CPU intensive work

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]

3.03.54.04.55.05.56.06.5

GFL

OP

/ s

CPU-FreqCPU-Lim1Fracas

CPU-Lim is less predictable (the outcome has higher variance)

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 13 / 20

IO-bound work

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]

3500

4000

4500

5000

5500

6000Lo

ops

/ s

CPU-FreqCPU-Lim1Fracas

CPU-Lim gives (unfair) advantage to IO-bound tasks

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 14 / 20

Multiprocessing

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]

2000

4000

6000

8000

10000Lo

ops

/ sCPU-FreqCPU-Lim1Fracas

Fracas can’t emulate CPU for multitask computation

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 15 / 20

Multithreading

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]

2000

4000

6000

8000

10000Lo

ops

/ sCPU-FreqCPU-Lim1Fracas

CPU-Lim controls processes instead of scheduling entities

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 16 / 20

Memory speed

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]

5000

6000

7000

8000

9000

10000

11000G

B / s

CPU-FreqCPU-Lim1Fracas

Memory speed is affected differently by each method

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 17 / 20

Summary of the evaluation

CPU-Freq:very good resultscoarse granularity

CPU-Lim:not scalable due to implementation, intrusivehigher variancecontrols processes, not threads

Fracas:good behavior for a single-task workloadscalablebad behavior for multitask workload

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 18 / 20

Future work

Explore other approachesImprove Fracas to cover multitaskingEmulate memory bandwidthEmulate other aspects of CPUIntegrate Fracas into WrekavocTake over the world :)

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 19 / 20

Conclusions

Presented Fracas, a method for CPU performance emulationbased on Linux cgroupsCompared with CPU-Freq and CPU-Lim (Wrekavoc)Evaluated experimentally on Grid’5000None of the methods is perfect:

CPU-Freq: coarse grainedCPU-Lim: implementation problems, not scalableFracas: works perfectly in single thread/process case, needs workin multithread/process case

Questions?

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 20 / 20