power management for chip-level multiprocessing processors

04/19/23 1

Power Management for Chip-level

Multiprocessing Processors

Kai Ma

04/19/23 2

Background

To get better performance

1. Scale frequency (fast)

2. On-chip resource replication (parallel) Chip-MultiProcessing vs Simultaneous MultiThreading

04/19/23 3

SMT vs CMP

SMT CMP

Technique Duplicate resources on one core

Duplicate cores on one die

Target Instruction level parallelism

Thread level parallelism

Implementation Basically redesign Reuse proven design

Area Small transistor increase

Proportionately to core number

04/19/23 4

Other justification for CMP

Memory wall, ILP wall, Power wall Higher cache coherency circuitry rate Signal integrity Future: Many cores (many specialized cores )

04/19/23 5

Power management for CMP

Reduce operating costs for energy and cooling Prolong battery life for portable and embedded systems Reduce cooling requirement Meet scalable performance target Heat dissipation and hotspot

04/19/23 6

Outline

1. An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget

Canturk Isci*, Alper Buyuktosunoglu*, Chen-Yong Cher*, Pradip Bose* and Margaret Martonosi

*IBM T.J. Watson Research Center Department of Electrical Engineering

Yorktown Heights Princeton University

2. Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors

Radu Teodorescu and Josep Torrellas

Department of Computer Science University of Illinois at Urbana-Champaign

04/19/23 7

Outline

An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget

1. Contribution

2. Global Power Management

3. Global Power Management Policies: core modes, power and performance matrix

4. Experimental Result and Evaluation

5. Conclusion

6. Critique

04/19/23 8

Contribution

Introduce a global power management

Develop a static power management analysis tool

Evaluate different policies for CMP power management

04/19/23 9

Global Power Management

Monitor the power and set working mode of each core

04/19/23 10

Global Power Management Policies

Priority: Slow down the core runs low priority task

PullhiPushLo: Speedup the low power core and slow down the high power core.

MaxBIPS: Predict and choose power mode combination

04/19/23 11

Core Power Modes

Underlying mechanism: DVFS Overhead: Order of microseconds Performance Degradation: Elapsed execution time for

benchmark

04/19/23 12

Power and BIPS Matrices

Power BIPS

Turbo 1 1*(500/507)

Eff1 1*0.95^3 1*0.95*(500/513)

Eff2 1*0.85^3 1*0.85*(500/520)

04/19/23 13

Experimental Methodology

SPEC CPU2000 benchmark A trace-based CMP analysis tool is incorporated with

IBM’s Turandot simulator Mode switch (500ns) and Statistics collection (50ns) During mode switch, no instruction execution, power is

consumed

04/19/23 14

Static vs Dynamic

04/19/23 15

Policy and Budget Curve

04/19/23 16

Power Saving

04/19/23 17

Power Management Result

04/19/23 18

Trends under CMP Scaling

The difference between MaxBIPS and oracle decreases with core number increasing

Increasing core numbers has smaller impact on MaxBIPS

CMP scales favor static per-core management over chip-wide DVFS

04/19/23 19

Conclusion

Global management is preferred

Dynamic management is preferred

MaxBIPS is efficient

04/19/23 20

Critique

MaxBIPS: Prediction is superlinearly dependent on the number of modes and core

Power performance estimation matrix: transition penalty

Not consider temperature

04/19/23 21

Outline

Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors

1. Background 2. Contribution 3. Algorithm 4. System Implementation 5. Evaluation 6. Conclusion 7. Critique

04/19/23 22

Background

For CMP, with-in die process variation impacts:

Static power consumption

Maximum frequency

04/19/23 23

Contribution

Propose variation-aware algorithms for application scheduling

Complement these algorithms with variation-aware DVFS

04/19/23 24

CMP Configuration

High level frequency and DVFS policy

04/19/23 25

Algorithms

04/19/23 26

Linear Programming

A technique for optimization of a linear objective function, subject to linear equality and linear inequality constraints

c and b are known vectors, A is a known matrix, x represents variables vector

04/19/23 27

Power Mode Selection: LinOpt TP : average throughput N: core number i : from 1 to N a(i) : constant depends on the thread and core v(i): core voltage b(i) and c(i): constants introduced to approximate power-voltage relation

Object function:

Constraints:

04/19/23 28

Power Mode Selection: SAnn

Use annealing algorithm to solve the power mode selection problem

SAnn searches all possible combination of core voltage

Compare to LinOpt: More accurate but more costly

04/19/23 29

System Implementation

Algorithm runs on a core or a power management unit At OS scheduling interval, OS assigns threads to cores by using

VarF&AppIPC Every 10ms, the LinOpt algorithm runs and sets the cores to correct

power

04/19/23 30

Profiling for Implementation

04/19/23 31

Evaluation Methodology

Variation:Varius model Power: SESC + Wattch+HotLeakage Temperature: HotSpot Critical Path Model:

1.Calculation path delay: Multiplier like unit

2.Memory: SRAM

3.Interconnection: Cacti

4.Gate delay: Alpha-power law

04/19/23 32

Workload

SPEC

Run different applications on different cores

12 billion instructions

04/19/23 33

Metrics

Total power Average frequency of active cores Throughput Energy delay-square product (consider Time-to-solution

and energy consumption) Weighted throughput: application’s IPC normalized to the

application’s IPC at reference conditions

04/19/23 34

Evaluation

Power and frequency variation on one die

04/19/23 35

Uniform Frequency & No DVFS

As the thread number increases, there is no less used core for thread mapping

04/19/23 36

NoUniform Frequency & No DVFS

Different cores run at different frequencies, by selecting less used core, they may end up with lower frequency ones.

04/19/23 37

NoUniFreq+DVFS

Throughput:

VarF&AppIPC+LinOpt is effective

Power:

throughput gains are high when power targets are low

04/19/23 38

LinOpt Granularity

Deviation between power consumed and power target decreases as interval between LinOpt run increases

04/19/23 39

Conclusion

With-in die variation substantially impacts static power consumed and maximum frequency

Variation-aware algorithms are proposed and analyzed, LinOpt is efficient

04/19/23 40

Critique

How to decouple thread mapping and power mode selection

Static power consumption and dynamic power consumption should be discussed separately

Thread mapping takes place once, thread migration should be considered

04/19/23 41

Comparison

MICRO UIUC

Objective Manage CMP processor power

Manage CMP processor power

Core 8 homogeneous Cores (POWER4 like)

20 inhomogeneous Cores (Alpha 21264 like)

Policy Global, Dynamic Global, Dynamic

Algorithm MaxBIPS LinOpt

Methodology Simulation (Turandot) Simulation (SESC)

Benchmark SPEC CPU2000 SPEC

Controller Ad-hoc Ad-hoc

power management for chip-level multiprocessing processors

Documents

low power core

core management

high power core

global power managementdevelop

preferreddynamic management

core modes

performance matrix

better performance